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PREFACE 


It  is  the  aim  of  this  book  to  present  the  fundamental  notions 
of  statistical  analysis  in  such  a  manner  that  they  can  be  compre- 
hended by  students  who  have  had  but  little  training,  in  mathematics 
and  yet  in  such  a  way  that  they  can  be  studied  to  advantage  even 
by  those  who  have  had  considerable  mathematics.  To  supplement 
the  mathematical  preparation  of  the  former  group  we  have  inter- 
mittently interrupted  the  continuity  of  the  statistical  procedure  by 
inserting  sections  on  certain  topics  of  advanced  algebra  and  analytic 
geometry  such  as  sums  and  summations,  some  properties  of  the 
straight  line,  permutations,  combinations,  and  the  elementary  theory 
of  probability. 

Many  of  the  basic  notions  of  statistical  analysis  are  expressed  by 
formulas,  the  derivations  of  which  have  been  assumed  —  altogether 
too  frequently  —  to  be  hidden  in  a  maze  of  higher  mathematics. 
For  a  number  of  years  we  have  encountered  a  growing  opihion  in 
some  circles  —  betrayed  by  clever  innuendo  and  subtle  insinuation 
when  not  definitely  expressed  —  that  how  to  use  a  formula  and  what 
it  means  are  the  primary  desiderata  in  statistical  analysis  and  that 
how  it  is  derived  and  what  are  its  limitations  are  of  secondary  im- 
portance. It  is  our  conviction  that  a  reader  will  not  comprehend 
fully  what  a  formula  means  and  what  are  its  limitations  unless  he 
knows  whence  it  comes  and  what  are  the  assumptions  underlying 
its  development. 

Since  the  mathematical  attainments  necessary  for  an  understand- 
ing of  the  development  of  many  of  oui*  basic  formulas  include  no 
more  than  a  knowledge  of  algebra  through  the  binomial  theorem, 
the  theory  and  use  of  logarithms,  and  the  progressions  —  topics  that 
are  included  in  a  well  organized  course  of  secondary  algebra  —  we 
have  included  many  derivations  that  come  within  the  grasp  of  the 
ordinary  student.  The  limited  preparation  in  mathematics  that  we 
assume  on  the  part  of  our  readers  requires  that  difficult  derivations 
be  generally  omitted. 
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While  the  theory  of  statistical  analysis  is  not  easy,  yet  the  diffi- 
culties are,  in  the  main,  due  to  the  newness  rather  than  to  the  a6- 
struseness  of  the  notions  encountered.  The  concepts  will,  therefore, 
become  more  meaningful  and  less  terrifying  if  the  student  will  be 
required  to  solve  many  of  the  numerous  exercises  that  have  been 
provided  in  the  text. 

Statistical  analysis  boils  down  ultimately  to  numerical  results: 
the  methods  and  processes  used  in  obtaining  them  and  the  methods 
and  means  for  estimating  their  reliability.  The  earlier  chapters  of 
the  book  are  concerned  mainly  with  the  methods,  processes,  and  forms 
used  in  obtaining  numerical  results  and  the  later  chapters  deal  with 
estimating  their  reliability. 

The  plan  used  in  the  development  of  the  text  may  be  briefly  de- 
scribed as  follows:  Each  topic  is  introduced  with  a  brief  statement  of 
'^what  it  is  all  about.''  Then  follows  a  brief  statement  of  the  under- 
lying theory  of  the  topic  under  consideration  which  leads  directly 
and  simply  to  a  development  of  the  necessary  formulas  and  processes. 
The  reader  is  then  shown  how  to  use  the  formulas  and  processes  to 
obtain  the  desired  numerical  results.  Finally,  the  limitations  of 
the  formulas  and  processes  and  the  significance  and  the  reliability 
of  the  computed  results  are  given  due  emphasis.  Thus  a  student 
learns  why  a  formula  is  applied,  whence  it  is  derived,  how  it  is  used  and 
what  are  its  limitations;  he  learns  not  only  how  to  obtain  the  numeri- 
cal results  but  also  how  to  measure  their  reliability. 

The  method  of  treatment  of  all  the  topics  is  decidedly  elementary. 
The  graphical  method  has  been  widely  employed  and  the  explana- 
tions have  been  purposely  detailed  in  order  that  the  book  may  be 
more  readily  understood.  Since  the  book  undertakes  to  develop 
skills  in  deriving  statistical  results  as  well  as  to  assist  in  understanding 
their  significance,  numerous  exercises  have  been  placed  at  strategic 
points  in  the  text.  This  feature  of  solving  exercises  after  a  major 
topic  has  been  considered  adds  to  the  teachableness  of  the  subject, 
facilitates  an  understanding  of  the  principles,  and  aids  the  student  in 
acquiring  the  useful  skills  for  statistical  computations  and  inter- 
pretation. 

In  general,  the  exercises  are  based  upon  actual  rather  than  imagi- 
n9xy  data  in  order  that  the  study  may  proceed,  if  possible,  with  real 
life  situations.    The  alert  teacher  can  improvise  homemade" 
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exercises  as  he  needs  them.  Throughout  the  text,  it  is  supposed  that 
a  computing  machine  is  at  the  disposal  of  the  student;  nevertheless 
many  of  the  exercises  can  be  done  satisfactorily  with  a  slide  rule  or 
a  table  of  logarithms,  powers,  and  roots. 

No  attempt  has  been  made  to  make  this  text  an  exhaustive  treatise 
on  statistical  analysis.  Many  topics,  such  as  multiple  correlation 
and  frequency  curves,  have  been  studiously  omitted.  We  have  tried 
to  keep  in  mind  that  we  are  writing  an  Introduction  that  would 
include  the  minimum  essentials,  at  the  same  time  hoping  that  this 
Introduction  might  inspire  the  reader  to  continue  his  study  into  the 
more  advanced  fields. 

I  wish  at  this  time  to  renew  my  thanks  to  Professor  James  W. 
Glover  and  Professor  Harry  C.  Carver  of  the  University  of  Michigan 
for  their  most  generous  aid  to  me  when  I  was  under  their  instruction. 
I  also  hasten  to  express  my  gratitude  to  Professor  C.  H.  Forsyth  of 
Dartmouth  College  and  Professor  Ralph  W.  Tyler  of  Ohio  State  Uni- 
versity, who  critically  read  the  manuscript  and  made  numerous 
helpful  suggestions. 

For  any  errors,  I  alone  am  responsible.  Although  the  text  has  been 
checked  painstakingly,  it  is  not  to  be  hoped  that  a  publication  of  this 
character  will  appear  without  some  errors  creeping  in.  For  the 
notification  of  such  errors  I  shall  be  most  grateful. 

PREFACE  TO  THE  ENLARGED  EDITION 

It  has  been  an  unexpected  gratification  to  the  author  and  the 
publishers  alike  to  find  that  an  enlarged  edition  of  the  book  is  called 
for  so  soon  after  its  publication.  The  only  criticism  of  consequence 
that  has  been  made  of  the  book  was  due  to  our  omission  of  Index 
Numbers.  After  sampling  the  opinion  of  many  teachers,  it  was  felt 
desirable  to  add  a  chapter  devoted  to  that  topic.  The  opportunity 
has  been  taken  to  recast  certain  paragraphs,  and  such  errors  as  were 
noted  have  been  corrected.  To  all  who  have  assisted  me  with  their 
suggestions  or  by  directing  my  attention  to  errors,  I  wish  to  express 
my  sincere  gratitude. 

C.  H.  Richardson 

Lewisburg,  Pennsylvania 
April  30,  1935 


PREFACE  TO  THE  REVISED  EDITION 

About  ten  years  have  elapsed  since  the  first  edition  of  this  book 
was  published  and  twelve  to  fifteen  years  since  the  material  for  the 
first  edition  was  collected  and  prepared.  During  this  time  a  tre- 
mendous appreciation  of  and  respect  for  statistical  techniques  have 
developed.  A  considerable  extension  of  the  use  of  statistical  tech- 
niques in  business,  in  public  administration,  and  in  the  social  sciences 
is  very  much  in  evidence.  Research  workers  in  biology,  in  education, 
in  psychology,  in  sociology,  in  agriculture,  lean  more  heavily  on 
statistical  techniques  than  ever  before.  And,  with  the  passing  of 
time,  there  has  come  a  demand  for  more  than  primer  notions:  a 
deeper  understanding  of  basic  ideas  is  mandatory.  For  example,  it 
no  longer  suffices  merely  to  compute  a  statistical  constant  or  statistic: 
one  must  evaluate  it,  determine  its  worth. 

Notable  gains  have  been  made  during  the  past  decade  in  the  de- 
velopment of  new  and  in  the  improvement  of  old  techniques.  En- 
riching the  old  areas  and  exploring  new  ones  have  challenged  some  of 
the  best  minds  of  the  world.  Creative  minds  in  pure  as  well  as  in 
appUed  mathematics  have  attacked  fundamental  problems  so  vigor- 
ously that  now  the  literature  of  the  field  is  colossal. 

Having  been  alert  to  these  new  developments  and  improvements, 
it  is  our  wish  to  incorporate  those  that  are  appropriate  into  this  new 
edition.  In  doing  this  we  have  sought  to  retain  the  main  features  of 
the  first  edition  since  the  plan  of  its  construction  has  met  the  approval 
of  a  wide  audience  of  teachers  and  students.  The  two  objectives, 
statistical  description  and  statistical  evaluation,  have  been  kept  in 
mind.  In  this  edition  we  are  not  giving  less  attention  to  statistical 
description  but  we  have  been  careful  to  give  more  emphasis  to  statis- 
tical evaluation  and  statistical  induction.  It  is  essential  that  the 
student  be  able  to  compute  a  statistic:  it  is  just  as  essential  that  he 
know  what  he  has  when  he  has  it,  and  to  know,  in  terms  of  proba- 
bility, what  he  can  do  with  it.  Consequently,  we  have  made  a  great 
effort  to  make  the  techniques  and  computations  meaningful.  At 
the  risk  of  being  prolix,  we  have  given  rather  full  verbal  discussions 
of  important  matters;  our  illustrative  examples  are  numerous  and 
their  solutions  detailed. 
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Along  with  the  progress  that  has  been  made  in  improving  old 
techniques  and  in  establishing  new  ones,  there  has  come  an  enlarged 
opportunity  for  the  study  of  statistical  analysis  by  more  and  more 
students  of  our  colleges.  Due  to  its  wide  application  a  knowledge 
of  statistical  methods  is  now  a  ''mu«t'^  in  a  program  for  a  liberal 
education.  Of  course  this  growth  has  been  influenced  greatly  by  the 
desire  of  thinkers  to  replace  as  far  as  possible  the  subjective  elements 
of  their  fields  by  objective  procedures.  On  the  whole,  this  substitu- 
tion of  objectivity  for  mere  opinion  has  been  healthful. 

The  thirteen  chapters  of  this  edition  fall  into  two  divisions,  each 
division  associated  with  a  definite  objective.  The  first  ten  chapters 
emphasize  statistical  description  whereas  the  last  three  chapters  em- 
phasize statistical  induction.  A  study  of  the  entire  book  is  con- 
sequently necessary  if  one  would  seek  an  understanding  of  what  is 
now  considered  to  be  the  essentials  of  elementary  statistics,  statistical 
description  and  statistical  induction. 

One  new  chapter,  Multiple  Correlation,  has  been  added  to  the 
present  edition.  New  sections  pertaining  to  other  topics  have  been 
inserted.  Many  sections  have  been  completely  rewritten,  others 
greatly  amplified.  The  numerical  exercises  have  been  multiplied 
and  the  algebra  of  statistics  has  been  extended.  The  book  has  there- 
fore not  only  been  revised  but  greatly  enlarged,  thus  providing  a 
wider  selection  of  topics  for  the  teachers. 

Many  friends  and  teachers  have  rendered  invaluable  assistance 
with  their  sympathetic  suggestions  for  the  improvement  of  the  book. 
These  suggestions  have  come  to  me  over  the  years.  I  wish  that  I 
might  mention  here  each  contributor  personally  but  the  list  is  too 
long.  However,  I  do  want  to  again  express  my  thanks  to  my  friend 
and  former  teacher,  Professor  A.  R.  Crathorne  of  the  University  of 
Illinois,  whose  generous  and  tactful  suggestions  have  been  invaluable. 
Also,  I  want  to  express  my  thanks  to  my  colleague,  Mr.  Paul  Benson, 
who  has  assisted  with  the  proof  and  has  made  numerous  helpful  sug- 
gestions.   Of  course  for  any  errors,  I  alone  am  responsible. 

In  this  edition  I  am  including  answers  to  many  of  the  exercises. 
Obviously,  it  is  too  much  to  expect  that  all  of  them  are  correct.  For 
the  notification  of  any  errors  I  shall  be  very  grateful. 


Lewisburg,  Pennsylvania 
September  1,  1943 


C.  H.  Richardson 
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INTRODUCTION 

1.  THE  MEANING  AND  IMPORTANCE 

OF  STATISTICS 

During  the  last  half-century,^  the  thinking  world  seems  to  have 
awakened  to  an  unusually  deep  appreciation  of  and  respect  for 
numerical  facts.  Even  the  untrained  mind  has  confidence  in  a 
conclusion  stated  in  numerical  language  and  supported  by  numerical 
facts.  Whether  the  affairs  are  of  state  or  laboratory,  we  must  have 
observed  that  quantitative  facts  concerning  them  are  collected  in 
bouii^less  profusion.  The  social  and  biological  sciences,  which  were 
qualitative  a  few  decades  ago,  have  now  become  largely  quantita- 
tive. Masses  of  nunierical  data  are  collected  by  individuals,  by 
corporations,  by  governments. 

These  masses  of  data,  numerical  facts,  measurements,  which  are 
generally  known  as  statisticSj  may  more  precisely  be  called  statistical 
daia.  The  special  methods  used  in  the  explanation  and  the  elucida- 
tion of  quantitative  data  may  be  fittingly  called  statistical  methods. 
The  analysis  which  is  peculiar  to  and  forms  the  basis  of  our  method 
we  call  statistical  analysis. 

The  word  statistics  is  generally  used  indiscriminately  in  two  dif- 
ferent senses:  on  the  one  hand  to  refer  to  statistical  material,  the 
group  of  numerical  data;  and  on  the  other  hand,  to  statistical 
analysis,  which  includes  those  technical  operations  that  have  to  do 
with  the  explanation  and  the  interpretation  of  the  numerical  data. 

As  we  shall  use  the  term,  statistics  is  the  science  which  deals  mith 
the  collectionj  the  organizationj  the  analysis^  and  the  interpretation  of 
masses  of  numerical  facts.  It  will  be  noted  that  this  definition  is 
broader  in  scope  than  that  given  by  Yule  and  Kendall.   They  say,^ 

^  This  is  not  meant  to  imply  that  statistics  is  a  new  subject.  See  the  Book  of 
Numbers  in  the  Bible.  See  also  H.  M.  Walker,  Studies  in  the  History  of  Statistical 
Method,  1929. 

*  G.  U.  Yule  and  M.  G.  Kendall,  Introduction  to  the  Theory  of  StatisticSf 
I2th  ed.,  p.  3.  1 
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By  statistics  we  mean  quantitative  data  affected  to  a  marked  extent 
by  a  multiplicity  of  causes. 

By  statistical  methods  we  mean  methods  specially  adapted  to  the  elu- 
cidation of  quantitative  data  affected  by  a  multiplicity  of  causes. 

By  theory  of  statistics  we  mean  the  exposition  of  statistical  methods. 

Statistical  methods  are  fundamentally  the  same  whether  employed 
in  the  analysis  of  physical  phenomena,  the  study  of  educational 
measurements,  the  records  of  biological  experiment,  or  the  analysis 
of  quantitative  material  in  economics.  All  such  data  are  affected 
to  a  marked  extent  by  a  multiplicity  of  causes.''  True,  the  physicist, 
the  chemist,  the  biologist,*  and  possibly  the  psychologist  attempt  to 
eliminate  many  disturbing  causes  and  to  concentrate  their  atten- 
tion upon  one  or  two  most  powerful  influences  affecting  their  phenom- 
ena, yet  many  disturbances  are  always  present.  However,  the 
same  general  procedure  is  followed  by  the  educationist  and  the 
economist.  Generally,  it  is  one  of  continued  summarization. 

We  shall  therefore  feel  free  to  apply  our  analysis  to  numerical  data 
whether  they  come  from  the  astronomer  or  the  agriculturist,  the 
physicist  or  the  economist,  the  biologist  or  the  chemist.  Wherever 
there  is  a  mass  of  numerical  data  that  admits  of  explanation,  we  shall 
consider  its  analysis  our  field  of  endeavor. 

The  fact  that  the  human  mind  is  incapable  of  comprehending  a 
large  number  of  impressions  at  one  time  is  generally  recognized.  A 
mass  of  numerical  data  is  an  appropriate  illustration.  To  grasp  the 
meaning  of  a  mass  of  numerical  data  we  must  reduce  its  bulk.  The 
organization  of  the  data  is  the  first  step  in  the  summarizing  process. 
It  is  a  phrase  that  is  used  to  describe  the  process  of  arranging  the 
data  in  a  compact  form  that  facilitates  computations  and  comparison. 
When  they  are  so  arranged,  ordered,  classified,  —  organized^  —  they 
are  then  in  a  form  suitable  for  the  analysis. 

The  process  of  abstracting  the  significant  facts  contained  in  a  mass 
of  numerical  data  and  making  clear  and  concise  statements  about  the 
derived  results  constitutes  a  statistical  analysis  of  the  data.  A 
statistical  analysis,  therefore,  enables  us  to  express  the  relevant 
information  contained  in  the  mass  of  data  by  means  of  a  few  numeri- 
cal values  known  as  statistical  constants  or  parameters^  each  constant 
describing  an  important  property  of  the  mass  of  data.  It  is  thus  the 
purpose  of  statistical  analysis  to  give  a  summarized  and  compre- 
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hensible  numerical  description  of  masses  of  numerical  data.  This  Is 
effected  by  computing  a  few  constants  pertaining  to  the  data  and 
understanding  their  meaning. 

Toward  this  numerical  description  of  the  mass  of  data  we  may 
adopt  two  points  of  view.  We  may  view  the  description  of  the  given 
mass  as  an  end  in  itself,  or  we  may  view  it  as  a  basis  for  generaliza- 
tion, as  a  basis  for  making  estimates  of  the  character  measured 
pertaining  to  a  larger  group.  The  smaller  group  that  is  analyzed 
we  call  a  sample^  the  larger  group  about  which  we  make  estimates 
we  call  the  ^parent  'population  or  universe.  The  interpretation  of  the 
data  is  a  phrase  used  when  we  adopt  the  larger  point  of  view  and 
make  estimates,  form  judgments,  or  draw  inferences  of  the  universe 
from  a  study  of  the  statistical  properties  of  the  sample. 

Let  us  consider,  for  example,  the  scores  of  100  freshmen  at  Buck- 
nell  University  on  a  standardized  Algebra  test  on  which  the  highest 
attainable  score  was  50.   Here  are  the  scores  in  Table  1. 

Table  1.  Scohes  of  100  Freshmen  on  an  Algebra  Test 


43 

18 

25 

18 

39 

44 

19 

20 

20 

26 

40 

45 

38 

25 

13 

14 

27 

41 

42 

17 

34 

31 

32 

27 

33 

37 

25 

26 

32 

25 

33 

34 

35 

46 

29 

24 

31 

34 

35 

24 

28 

30 

41 

32 

29 

28 

30 

31 

30 

31 

28 

31 

30 

34 

40 

29 

46 

30 

30 

47 

31 

35 

36 

29 

26 

32 

36 

35 

36 

37 

32 

23 

22 

29 

33 

37 

33 

27 

24 

36 

23 

42 

29 

37 

19 

23 

44 

41 

45 

39 

21 

21 

42 

22 

28 

38 

15 

16 

17 

28 

If  we  desire  such  information  regarding  the  above  scores  as  is 
found  in  the  answers  to  the  following  questions,  we  must  look  through 
the  entire  table. 

1.  How  many  students  obtained  scores  greater  than  43? 

2.  How  many  obtained  scores  greater  than  22  and  less  than  43? 

3.  How  many  obtained  scores  less  than  23? 

4.  What  is  the  lower  boundary  of  the  upper  20%  of  the  scores? 
6.  What  is  the  lower  boundary  of  the  upper  40%  of  the  scores? 

Such  questions  may  be  readily  answered  if  we  organize  the  data 
by  arranging  the  scores  into  classes,  as  we  have  done  in  Table  2. 
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Table  2.   Organization  of 
THE  Data  of  Table  1 


7/-.  r.  n 

Frequency f  or 
the  number  of 
scores  in  the 
niDRTL  rhiRS 

42.5-47.5 

8 

37.5-42.5 

12 

32.5-37.5 

20 

27.5-32.5 

28 

22.5-27.5 

16 

17.5-22.5 

10 

12.5-17.5 

6 

Total 

100 

The  new  table  is  called  a  frequency  table  for  it  gives  the  frequency 
(the  number  of  scores)  in  the  respective  intervals.  Evidently  the 
organization  of  the  data  presents  them  in  a  form  that  is  more  suitable 
for  statistical  purposes  than  the  disorganized  form  of  Table  1  does. 

Exercise.  Make  a  list  of  several  facts  that  you  can  immediately 
discover  from  Table  2.  Answer  the  questions  that  we  have  listed  on  the 
preceding  page. 

It  must  not  be  supposed  that  the  answers  to  the  above  questions 
constitute  the  analysis  of  the  data.  The  analysis  is  contained  in 
the  following  constants  that  we  shall  later  learn  to  compute  and 
interpret. 

M  =  30.7  a  =  7.85 

Md  =  30.71  Qi  =  25.3 

Mo  =  30.5  Qi  =  36.25 

each  expressed  in  the  given  unit  of  measure. 

To  undertake  an  interpretation  at  this  time  would  take  us  too  far 
afield. 

2.  MATHEMATICAL  AND  NON-MATHEMATICAL 

ASPECTS  OF  STATISTICS 

As  has  been  indicated  in  our  definition,  the  steps  involved  in  the 
solution  of  a  statistical  problem  may  be  summarized  as  follows : 
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1.  The  collection  of  the  data 

2.  The  organization  of  the  data 

3.  The  analysis  of  the  data 

4.  The  interpretation  and  criticism  of  the  results 

The  collection  of  the  data  and  their  organization  are  largely  non- 
mathematical  operations.  In  regard  to  this  Bowley  says,  Common 
sense  is  the  chief  requisite  and  experience  the  chief  teacher."  ^ 
However,  we  shall  refer  to  these  items  in  later  chapters.  The  ele- 
mentary analysis  of  the  data  involves  in  general  no  so-called  higher 
mathematics.  It  is  well  to  understand  algebraic  averages  and  some 
of  the  elementary  principles  of  the  algebra  of  summations.  These 
will  be  considered  in  another  section  of  this  chapter.  An  under- 
standing of  the  calculus  is  always  helpful  and  at  certain  times  highly 
desirable,  but  this  preparation  is  not  necessary  for  the  elementary 
course.  The  student  who  desires  a  knowledge  of  the  more  refined 
methods  of  statistical  analysis  will  find  an  understanding  of  the 
calculus  and  the  theory  of  probability  indispensable.^  For  the 
interpretation  and  the  criticism  of  the  results,  one  cannot  know 
too  much.   Bowley  says  in  this  regard: 

For  criticism  of  estimates  and  interpretation  of  results  it  is  necessary 
to  use  formulae  of  more  advanced  mathematics,  and  it  is  obviously  ex- 
pedient to  understand  the  methods  by  which  these  formulae  are  obtained 
to  ensure  their  intelligent  use.^ 

Since  this  book  is  essentially  one  of  the  methods  of  elementary 
analysis  of  statistical  data,  all  technical  questions  that  require  a 
considerable  knowledge  of  advanced  mathematics  will  be  omitted. 

3.  VARIABLES  AND  FUNCTIONS 

A  common  property  of  any  character  with  which  statistics  is  con- 
cerned is  that  of  variation  or  change.  The  grades  of  a  class  in 
geometry,  the  scores  in  an  examination,  the  heights  of  a  group,  the 
number  of  petals  on  a  group  of  buttercups,  the  production  of  wheat 
from  year  to  year,  even  a  group  of  measurements  of  the  length  of  a 
room  —  all  these  show  variation.    In  statistics  the  magnitudes  of 

^  A.  L.  Bowley,  Elements  of  Statistics^  p.  14. 

*  E.  V.  Huntington,  "Mathematics  and  Statistics,"  American  Maihemntical 
Monthly t  December,  1919. 

•  Bowley,  op.  cU.,  p.  14. 
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the  character  measured  are  frequently  called  variates,  A  variate, 
then,  is  a  very  special  and  specific  use  of  the  broader  term  variabley 
which  signifies  any  quantity  that  changes  in  magnitude. 


Table  3.   Grades  of  a  Class  in  Algebra 


Grade 
X 

Frequency f  or  the  number  of 
students  receiving  the  grade 
fiX) 

65 

3 

75 

14 

85 

10 

95 

3 

(Total) 

30 

In  Table  3  there  are  two  variables  —  the  grades,  X,  and  the 
frequencies,  fiX)  —  but  the  variates  are  the  magnitudes  of  the 
grades. 

We  shall  find  it  necessary  and  convenient  to  recognize  two  distinct 
classes  of  variates,  continuous,  and  discrete  or  discontinuous. 

A  continuous  variate  is  one  whose  magnitudes  may  differ  by 
infinitesimal  amounts  between  certain  limits:  for  example,  the 
weight  of  a  man,  the  temperature  of  a  place,  the  length  of  a  bean  pod, 
the  height  of  a  plant. 

A  discrete  or  discontinuous  variate  is  one  whose  value  must  be 
described  in  integers;  for  example,  the  number  of  pupils  in  a  class, 
the  number  of  kernels  on  an  ear  of  corn,  the  number  of  seeds  in  an 
apple,  the  number  of  culms  on  an  oat  plant.  A  discrete  variate  is 
sometimes  called  an  integral  variate. 

As  in  other  fields  of  mathematics,  it  will  be  convenient  to  recall 
another  classification  of  variables,  namely,  the  independent  and  the 
dependent.  The  independent  variable  is  the  one  to  which  we  assign 
values  at  pleasure,  whereas  the  dependent  variable  is  the  one  whose 
value  depends  upon  that  assigned  to  the  independent  variable. 

In  Table  3  the  frequency  of  a  given  grade  depends  upon  the  grade 
we  have  assigned  at  pleasure.  The  frequency  is,  therefore,  the 
dependent  variable.  The  dependent  variable  is  frequently  called  a 
function  of  the  independent  variable  or  argument.  The  independent 
variables  will  be  represented  in  this  text  by  X,  x\  x,  or  i;  the  func- 
tional, or  dependent  variable  by  y,  f(x),  or  /(<),  etc.  We  shall  say 
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that  y  or  f{x)  is  a  function  of  x  if  y  is  dependent  upon  x  and  if  to  every 
value  of  X  there  corresponds  a  value  of  y  orf(x).  It  will  not  be  necessary 
to  describe  this  correspondence  by  means  of  an  equation. 

4.  SUMS  AND  SUMMATIONS 

Statistics  may  be  roughly  defined  as  the  study  of  averages.  Since 
nearly  all  averages  involve  the  evaluation  of  certain  sums,  it  will 
be  well  at  this  point  to  acquaint  ourselves  with  an  abbreviated 
notation  for  sums  and  develop  some  useful  formulas  that  will  aid 
us  in  quickly  evaluating  later  sums.  We  shall  discover  that  a  facility 
with  this  new  notation  will  quicken  our  understanding  of  the  later 
chapters. 

In  elementary  algebra  if  we  add  a  set  of  letters  Xi,  X2,  .  .  .,  Xn  we 
indicate  the  sum  by: 

Xi  +  X2  +  Xz  +   '   '   '   +  Xn 

In  more  advanced  mathematics  we  would  designate  this  sum  com- 

n 

pactly  as  Thus: 

n 

2x»  =  Xi  +  X2  +  Xz  +  '  '  '  +  Xn 

This  should  be  read  "sigma  of  x  sub  i  (or  summation  of  x  sub  i) 
when  i  assumes  all  integral  values  from  1  to  n  inclusive.*^ 

The  Greek  capital  letter  S  (sigma)  placed  before  a  term  signifies 
the  sum  of  all  terms  of  which  that  term  is  the  general  type. 

Thus: 

n 

12  +  22  +  32  +  .  .  .  +  n2  = 

1 

111  1^1 
1     Z     6  n      I  X 

6 

2^  +  38  +  43  +  53  +  6^  =  2x3 

2 

n 

log  5  +  log  7  +  log  9  +  •  •  •  +  log  (2n  -  3)  «  2  log  (2x  -  3) 

4 

and  in  general 

/(I)  +  /(2)  +  /(3)  +  .  .  .  +  f{n)  =  2/(x)  (1) 
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The  values  below  and  above  the  summation  symbol  S  which  are 
the  initial  and  final  values  of  the  independent  variable  are  the 
limits  of  the  summation.  The  limits  of  the  summation  may  have 
any  values  and  the  independent  variable  may  change  by  other 
amounts  than  unity.  Slight  changes  in  the  notation  make  these  ex- 
tensions possible.  Thus: 

100  20 

X6  +  Xio  +  Xn  +  '  '  *  +  Xm  =  ^Xi  =  Sxbi 

i=»6, 10, ...  i~l 

95 

/(65)  +  /(75)  +  /(85)  +  /(95)  =  S/(a;) 

65,  75,  .  .  . 
15 

5/(5)  +  10/(10)  +  15/(15)  +  •  •  •  +  75/(75)  =  S5x/(5x) 

1 

X  —  Tl 

/(a)  +  /(a  +  6)  +  /(a  +  26)  +  •  •  .  +  /(a  +  nb)  =  S/(a  +  xb) 

T  =  0 

If  the  quantity  under  the  summation  S  does  not  contain  a  variable, 
all  the  terms  are  equal.  As  examples  we  have: 

2l  =  l  +  l  +  l  +  l+  --  -+  l  =  iV 
Xc^c  +  c  +  c  +  c+  '''+c  =  Nc 

where  N  is  the  number  of  observations  or  measurements. 

It  frequently  happens  that  there  is  no  necessity  for  writing  the 
independent  variable  or  the  limits  of  the  summation  below  and  above 
the  summation  symbol.  When  the  context  tells  us  what  is  meant, 
we  shall  resort  to  this  plan.  Thus,  if  the  lower  or  upper  limit  is 
omitted,  it  is  assumed  to  be  1  or  n  respectively. 

EXERCISES 

Write  the  series  that  are  represented  by  the  following  symbols: 

n  1  n  n  10 

1.  2~  2.  S2«  3.  S(x  -  3)  4.  ZiqC^ 

n  10  100  80 

5.  Xxa'  6.     l^XioC^  7.      2x/(x)  8.  Xx'^fix) 

x«l  x-l  10,20,...  6,10.... 

Write  in  the  abbreviated  form  using  S. 

9.  1  •  2  +_2  •  3  +  3  •  4_+  •  •  •  +  n(n_+  1) 
10.  {Xi  ~  My  +  (Z2  -  MY  +  (X,  -  M)2  +  -  •  +  {Xn  -T  M)2 

We  will  now  consider  two  useful  and  important  theorems. 
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Theorem  I.  The  S  {sigma)  of  an  algebraic  sum  of  several  functions 
is  equal  to  the  algebraical  sum  of  the  sigmas  of  the  several  functions. 
Symbolically,  we  state  that: 

S[/(x)  ±  Fix)  ±  w{x)  ±  etc.]  =  S/(x)  ±  I,F{x)  ±  i:w{x)  ±  etc. 

Theorem  11.  The  S  (sigma)  of  a  constant  times  a  function  is  equal 
to  the  constant  times  the  sigma  of  the  function.  Symbolically,  we  state 


This  means  of  course  that  the  constant  factor,' c,  may  be  placed  to  the 
left  or  to  the  right  of  the  2  at  pleasure. 

Proof  of  Theorem  I,   (for  two  functions) 

By  the  definition  (1): 


SC/(x)  ±  F{x)2 

=  /(I)  ±  F(l)  +  /(2)  ±  F(2)  +  •  •  •  +  /(n)  d=  Fin) 

=  [/(I)  +  /(2)  +  •  •  •  +  /(n)]  ±         +  Fi2)  +  •  •  •  +  Fin)'] 

=  2/(a:)  ±  2/^(x) 


The  proof  is  easily  extended  to  any  number  of  functions. 

The  proof  of  Theorem  II  will  be  left  as  an  exercise  for  the  stu- 
dent. 

Example.    From  the  identity. 


that: 


Xcfix)  =  c2/(x) 


-  (x  -  ly  =  2x  -  1 

we  shall  prove : 


From  the  preceding  definitions  and  theorems  we  have 

2:[x2  -  (x  -  1)^]  =  22x  -  21  =  22x  -  n 
and  this  means,  by  (1): 


+"^\^  i  =  2Zx  -  n 


=  22:x  -  n 


or 

Hence,  we  obtain: 


n(n  +  1) 
2 
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EXERCISES 


State  in  words. 


3.  Prove: 


n 


2(22:  -  1)  =  n\ 


X 


4.  Apply  the  result  of  Number  3  to  find  the  sums: 


44  5 


(1)  11  +  13  +  15  +  .  •  •  +  87  =  S(2a:  -  1)  -  S(2a: 


1) 


1  1 


(2)  127  +  129  +  131  +  •  •  •  +  195 

The  above  definitions,  theorems,  and  exercises  are  concerned  with 
the  abstract  algebra  of  summation.  The  numbers  that  have  appeared 
as  illustrations  enjoy  a  degree  of  regularity  not  found  in  observed 
measurements.  Thus,  such  series  as :  1,2,3,  .  .  .,25;  P,  2^,  3^,  .  . 
16^  are  not  often  found  in  actual  measurements.  So  this  abstract 
algebra  is  apparently  not  valuable  in  dealing  with  observed  data. 
Actually,  we  shall  find  this  abstract  algebra  of  summation  very  help- 
ful in  developing  statistical  theory  and  frequently  helpful  in  dealing 
with  real  measurements. 

The  numbers  we  meet  in  numerical  problems  are  real  measure- 
ments or  scores  that  come  from  actual  observation  and  they  do  not 
generally  proceed  with  regularity.  We  should  understand  thoroughly 
how  the  algebra  of  summation  applies  to  such  measurements.  Ten 
men  in  a  class  in  Statistical  Analysis  gave  their  weights:  128,  131, 
137,  143,  144,  146,  147,  149,  155,  170  pounds.  Obviously  these 
numbers  do  not  proceed  from  the  small  to  large  values  with  the 
regularity  of  the  numbers:  P,  2^,  3^,  .  .  .,  12^.  However,  we  can 
apply  our  2  notation  to  such  irregular  series  as  the  weight  data. 

Let  us  arrange  our  data  as  in  the  adjacent  vertical  columns  where 
we  use  the  upper  case  X,  capital  to  indicate  a  measurement  of 
weight.  The  first  measurement  we  indicate  by  Xi,  the  second  by  X2, 
and  so  on.  We  are  then  able  to  express  the  sum  of  the  weights  by  the 
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summation  sign,  S.  Out  of  curiosity  we  found  the  average  weight  of 


Weiaht 


7' 


the  ten  men. 

In  this  case  we  note  that  it  is  the  subscript  i 
that  varies,  and  with  the  weight  of  each  man  is 
associated  a  subscript.  It  is  generally  not  neces- 
sary to  indicate  so  precisely  the  indexes  of  the 
summation.  It  suffices  to  know  that  X  refers  to 
the  characteristic  measured,  weight  of  a  man, 
and  SX  represents  the  sum  of  the  weights  of 
the  men.   Consequently  we  could  write 

Sum  of  the  weights  =  2X  =  1450  pounds 

and  not  be  misunderstood. 

To  save  labor  in  statistical  computations  we 
find  it  convenient  to  effect  simple  transforma- 
tions upon  the  variates.  Thus,  for  the  weight 
data  we  can  work  with  smaller  numbers  if  we 
refer  our  weights  to  some  conveniently  chosen  number,  say  100, 
instead  of  to  0.  Since  X  represents  the  measurements  referred  to 
zero  as  origin,  we  must  choose  a  new  letter,  say  U,  to  represent 


Ai 

1  OQ 

Xo 

Xs 

137 

Xa 

X, 

— 

144 

X, 

146 

147 

As 

i4y 

A) 

155 

A 10 

170 

10 

SX, 

1450 

Average 

1450 

weight 

10 

145  lbs. 

Table  4.  Weights  of  10  Men 


Weight 
X 

u 

=  X  - 

-  100 

Xi  =  128 

Ih 

28  = 

Xt  - 

100 

Xo  =  131 

IJ2 

31  = 

X2  — 

100 

X3  =  137 

Us 

37  = 

- 

100 

X4  =  143 

43  = 

X,  - 

100 

X5  =  144 

u. 

44  - 

X,  - 

100 

Xe  =  146 

U, 

46  = 

X,  - 

100 

X7  =  147 

U7 

47  = 

X,  - 

100 

Xs  =  149 

Us 

49  = 

x>  - 

100 

X9  =  155 

u. 

55  = 

X,  - 

100 

Xio  =  170 

t/10 

70  = 

Xio  — 

100 

Adding,  SC/ 
2X 

Average  2X 
weight  10 


460  -  2X  -  10(100) 
10(100)  +  450 

100  +  45  =  145  lbs. 
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the  measurements  referred  to  100  as  origin.  We  pose  the  ques- 
tion :  Can  we  find       by  finding  and  using  S  i7? 

The  first  weight  whose  X  is  128  will  have  a  C/  of  28;  the  second 
weight  has  X  =  131  and  (7  =  31;  and  so  on.   Obviously  we  have 

C;  =  X  -  100 

for  each  of  the  measurements.  The  detail  is  shown  in  Table  4. 

In  practice  we  do  not  go  into  such  detail.  We  abbreviate  our  work 
a  great  deal  as  is  indicated  in  Table  5.  We  follow  a  few  systematic 
steps: 

(1)  We  decide  upon  the  transformation  we  wish  to  use.  For  the 
weight  data  we  use  U  ==  X  —  100. 

(2)  We  complete  the  table  to  agree  with  the  chosen  transformation. 
That  is,  we  find  the  U  that  corresponds  to  a  given  X. 

(3)  We  derive  a  formula  to  agree  with  the  chosen  transformation. 
Thus,  when  J7  =  X  ~  100,  we  have 

X  =  E/  +  100 
SX  =  S(t7  +  100)  =  Sf/  +  2100 
SX  =  SC/  +  10(100) 

since  iV,  the  number  of  measurements,  is  10. 

(4)  We  substitute  the  values  found  from  the  table,  step  (2),  into 
the  formula  we  derive  in  step  (3). 


Table  5.  Weights  of  10  Men 


We^ht 

L/  =  X 

~  100 

128 

28 

131 

31 

137 

37 

143 

43 

144 

44 

146 

46 

147 

47 

149 

49 

155 

55 

170 

70 

450  =- 

2[/ 

Solution 
C/  =  X  -  100 

X  =  u  +m 

2X  =  S[7  +  Ar(lOO) 
^^+100 


N 


N 


Average  _  450  ,  ™ 
weight  To 

=  145  pounds 
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We  now  submit  two  more  illustrations  that  involve  simple  trans- 
formations.   In  each  case  our  problem  is  to  find       by  using  XU. 


X 

U  - 

X 
125 

125 

1 

250 

2 

375 

3 

500 

4 

625 

5 

15  = 

X 

U  = 

A  —  oo4 
128 

128 

~  2 

256 

~  1 

384 

0 

512 

1 

640 

2 

768 

3 

3 

=  2(7 

=       or  X  =  125U 

2X  =  2125(7  =  1252(7 
2X  =  125(15)  =  1875 


U 

2X 
2X 
2X 


'^^28^^  or  X  =  128^7  +  384 

2(128[/  +  384)  =  2128^7  +  2384 
1282(7 +  iV(384) 
128(3)  +  6(384)  =  2688 


EXERCISES 

1.  Complete  the  following  tables  and  find  the  values  of  the  quantities 
suggested. 

a  b 


X 

a;  «  X  -  81 

x«  X 

(7  «  X  -  50 

85 

72 

94 

75 

73 

60 

66 

53 

87 

75 

95 

92 

80 

73 

72 

60 

95 

85 

63 

55 

2X=  810 
Av.  =  81 

2x 
(2x)» 

2x» 
V2i"2 


(  ) 
( ) 


2f;  =  (  )       217'  =(  ) 
Show  that  SX  =  10(50)  +  2U 

2X  =  () 
Can  you  find  2X«  by  using  2C/ 
and  21/*? 

SX«  =  (  ) 
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X 

Y 

XY 

F«  X 

Y 

a;  =  X  -  6 

2/  =  y-7 

a:* 

2 

11 

2 

11 

4 

8 

4 

8 

6 

7 

6 

7 

8 

5 

8 

5 

10 

4 

10 

4 

30 

Average  6 

35 
7 

=  (  )  2.:  =  (  )  22/  =  (  ) 

272  =  (  )  =(  )  ^xy  =  {) 

2X7  =  (  )  2i^  ^  ^  ^ 


5.   REMARKS  ON  MEASUREMENT 

It  is  almost  a  commonplace  that  nearly  all  the  numerical  data 
used  by  the  statistician  are  necessarily  approximations,  true  usually 
to  two,  three,  or  more  figures.  The  observer  who  collects  the  original 
data  and  the  statistician  who  undertakes  to  analyze  and  interpret 
them  are  frequently  different  individuals.  The  statistician  must 
accept  the  measurements  that  are  given  him  and  should  seek  to 
obtain  results  that  are  consistent  with  the  data. 

The  degree  of  approximation  of  a  measurement  depends  upon  the 
skill  and  the  carefulness  of  the  operator  and  upon  the  kind  of  instru- 
ment used.  Of  course  it  may  happen  that  a  measurement  is  exactly 
the  true  value  of  the  quantity  measured,  but  the  measurer  can  never 
know  when  this  is  so.  Since  statistics  has  to  do  primarily  with 
observed  measurements,  which  are  admittedly  approximations,  and 
with  processes  that  are  also  approximative,  it  is  obvious  that  any 
numerical  result  computed  from  them  will  in  like  manner  be  an 
approximation. 

6.  DECIMAL  ACCURACY 

The  average  student  may  have  some  difficulty  in  grasping  the  idea 
that  accuracy  is  a  relative  matter  and  absolute  precision  of  measure- 
ment an  impossibility.  He  is  accustomed  to  think  of  9.7  as  meaning 
the  same  thing  as  9.70  and  even  9.70000000  ...  to  an  unlimited 
number  of  decimal  places. 
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If  9.7  does  not  mean  the  ideal  number  9.7000000  .  .  .,  what  does 
it  mean?  For  the  sake  of  clarity  of  understandipg  and  precision  of 
statement,  the  scientist  has  adopted  the  convention  that  *'9.7'' 
means  between  9.65  and  9.75.  If  we  record  the  length  of  a  line  as 
18  inches  we  mean  that  it  lies  between  17.5  inches  and  18.6  inches. 
When  we  say  that  the  distance  to  the  moon  is  240,000  miles,  we 
mean  that  that  distance  is  between  235,000  and  245,000  miles. 

A  measurement  recorded  as  18  inches  means  that  the  measurement 
is  correct  to  the  nearest  inch  or  to  units.  A  measurement  ^  recorded 
as  9.7  cm.  means  that  the  number  is  correct  to  the  nearest  tenth  of  a 
centimeter.  The  number  is  sometimes  written  9.7  db  0.05  in  which 
the  expression  0.05  should  be  read  '*with  a  possible  error  of  0.05." 

Similarly,  a  recorded  value  of  9.70  would  mean  from  9.695  to  9.705 
and  might  be  written  9.70  db  0.005. 

Unless  otherwise  specified,  a  score  for  a  continuous  variate  should 
be  interpreted  as  extending  from  half  a  unit  of  the  last  place  of  the 
measurement  below  to  half  a  unit  above  the  recorded  entry.  A 
similar  assumption  regarding  discrete  data  avoids  confusion  in  the 
analysis.  Hence  we  shall  assume  that  a  measurement  for  a  discrete 
variate  extends  from  half  a  unit  below  to  half  a  unit  above  the 
recorded  score. 

7.   SIGNIFICANT  FIGURES 

In  the  expression  9.7  cm.,  both  the  9  and  the  7  mean  something  or 
are  significant.  In  the  expression  97  mm.,  there  are  likewise  two 
significant  figures.  There  are  five  significant  figures  in  each  of  the 
numbers  203.05,  263.10,  0.0076389,  500.00,  but  only  two  in  the 
number  93,000,000  which  gives  the  approximate  number  of  miles 
from  the  earth  to  the  sun. 

When  the  distance  from  the  earth  to  the  sun  is  given  as  93,000,000 
miles,  in  the  light  of  the  convention  that  we  discussed  in  the  pre- 
ceding section,  the  statement  might  be  interpreted  to  mean  that  the 
distance  is  between  92,999,999.5  and  93,000,000.5  miles.  Since  the 
figures  9  and  3  alone  are  to  be  regarded  as  significant,  the  exact  dis- 
tance is  between  92,500,000  and  93,500,000  miles.  This  confusion  can 
be  prevented  by  writing  the  number  in  the  standard  form  9.3  X  10^, 

*  If  a  measurement  may  be  written  a  db  c,  we  call  e  the  possible  error  In  a, 
the  measurement. 
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the  number  of  significant  figures  being  indicated  by  the  factor  at 
the  left  which  has  one  figure  before  the  decimal  point. 

We  determine  the  significant  digits  in  a  number  by  reading  the 
number  from  left  to  right,  commencing  with  the  first  digit  not  zero 
and  ending  with  the  last  digit  accurately  specified.  The  position  of 
the  decimal  point  has  no  influence  on  the  number  of  significant  digits. 

Thus  34  has  two  significant  figures;  7.3,  two;  406,  three;  7,003, 
four;  8.0,  two;  0.40,  two;  9.00,  three;  0.006,  one;  0.0050,  two; 
and  2.4  X  10«,  two. 

8.   ROUNDING  OFF  NUMBERS 

Sometimes  we  are  furnished  with  numbers  recording  measurements 
that  are  given  with  a  greater  accuracy  than  we  can  use,  or  care  to  use. 
We  accordingly  round  them  off  to  the  accuracy  desired. 

.A  number  is  rounded  off  by  dropping  one  or  more  digits  at  the 
right.  When  the  digit  dropped  is  5  or  more,  increase  the  preceding 
digit  by  unity;  when  it  is  less  than  5,  retain  the  preceding  digit 
unchanged. 

The  following  numbers  are  rounded  according  to  the  above  rule: 


Numbers  Rounded  Values 

4.5647  4.565;  4.57;  etc. 

0.49781  0.498;  0.50;  etc. 
17.65  17.7 
17.75  17.8 


9.   ERRORS  IN  CALCULATIONS 

As  magnitudes  determined  by  measurement  are  not  exact,  it  is 
important  to  make  clear  the  meaning  of  the  term  error  as  it  is  used  in 
statistics. 

In  the  first  place,  errors  are  not  necessarily  what  we  usually  think 
of  as  mistakes  or  blunders.  The  latter  arise  from  carelessness  or 
incompetency  in  transcribing  figures  or  reading  values  from  a  scale. 
An  absolute  error  in  observation  is  the  difference  between  a  given 
measurement  and  the  true  value  of  the  quantity  measured.  There- 
fore, an  error  means  a  deviation,  a  difference,  but  not  a  mistake. 

The  relative  error  in  a  measurement  is  the  ratio  of  the  absolute  error 
to  the  true  value  of  the  quantity.  It  may  be  closely  approximated 
by  finding  the  ratio  of  the  possible  error  to  the  given  measurement. 
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It  is  usually  expressed  as  a  percentage.  Thus  if  a  measurement  of 
height  is  given  as  68.5  inches,  there  is  a  possible  error  of  0.05  inches 
and  an  approximate  relative  error  of  0.05/68.5  =  0.0007,  which  equals 
0.07  per  cent.  If  a  physician  reports  the  weight  of  a  man  as  163 
pounds  with  a  possible  error  of  0.5  pound,  the  approximate  relative 
error  may  be  written  as  0.5/163  ==  0.003  =  0.3  per  cent. 

The  relative  errors  in  the  two  distances  9.3  X  10^  and  9.30  X  10^ 
are  approximately : 


This  illustrates  the  fact  that  the  relative  error  depends  upon  the 
number  of  significant  figures  in  and  not  upon  the  position  of  the  deci- 
mal point  in  a  recorded  measurement. 


1.  How  many  significant  figures  are  in  the  following  numbers? 

(1)  2.375      (2)  0.0347      (3)  0.0030      (4)  5.03(10^)       (5)  5.6300(10^) 

2.  What  is  the  rule  to  be  observed  when  rounding  off  numbers? 

3.  A  line  is  measured  and  its  length  is  recorded  as  118.63  feet.  What 
does  this  statement  mean?  What  is  the  approximate  relative  error  in 
the  measurement? 

4.  A  line  is  measured  and  its  length  is  recorded  as  125.65  feet.  Wliat 
does  tliis  statement  mean?  What  is  the  approximate  relative  error  in  the 
mcasuren^ent? 

6.  The  population  of  a  city  is  given  as  2.5  million.  What  is  the 
approximate  ])ercentage  error? 

6.  The  population  of  a  city  is  given  as  340  thousand.  What  is  the 
approximate  percentage  error? 

7.  The  value  of  t  correct  to  five  significant  figures  is  3.1416.  De- 
termine the  percentage  error  when  tt  is  approximated  by  3?. 

8.  The  values  of  all  mineral  production  in  continental  United  States 
in  1929,  correct  to  the  nearest  million  dollars,  was  $5,165,000,000.  Write 
this  value  in  the  sUmdard  form.  Find  the  approximate  percentage  error 
in  the  given  estimated  value. 

9.  Prove : 


and 


0.005 
9.30 


=  0.0005  =  0.05% 


EXERCISES 


n 


J.2x  =  n(n  4-  1). 


1 


10.  Use  the  result  of  Number  9  to  find  the  sums: 

(1)  30  +  32  -f  34  -f  •  •  •  +  96. 

(2)  128  +  130  +  132  +  •  •  •  +  164. 
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11.  Find  the  sum  of  the  following  numbers  correct  to  two  decimal 
places:  2.4286,  12.673,  127.87,  35.583: 

(1)  By  retaining  the  significant  figures  of  the  numbers  and  rounding  off 
the  sum  to  two  places; 

(2)  By  rounding  off  each  number  to  two  places  and  finding  the  sum. 

This  exercise  illustrates  the  rule:  "When  several  approximate  numbers 
are  to  be  added,  it  is  best  to  round  them  at  once  to  the  number  of  decimal 
places  in  the  least  accurate  measurement." 

12.  Find  the  sum  of  the  following  numbers  correct  to  two  decimal  places: 
3.4285,  16.743,  253.78,  36.583: 

(1)  By  retaining  the  significant  figures  of  the  numbers  and  rounding  off 
the  sum  to  two  places; 

(2)  By  rounding  off  to  two  places  each  number  and  finding  the  sum. 

10.   THE  PROPAGATION  OF  ERRORS 

'In  general,  statistical  computations  are  more  concerned  with 
relative  than  with  absolute  errors.  We  shall  include  here  the  more 
important  theorems  that  relate  to  relative  errors  and  expect  the 
reader  who  desires  a  wider  knowledge  to  consult  the  splendid  work 
by  Scarborough.^ 

Theorem  I.  The  possible  error  in  the  sum  or  the  difference  of  two 
measurements  is  equal  to  the  sum  of  the  possible  errors  in  the  indi- 
vidual measurements. 

Suppose  a  and  h  are  the  readings  of  the  two  measurements  and  that 
ei  and  e^  are  the  numerical  values  of  their  errors.  The  true  values  are 
therefore  a  +  ei  and  h  4-  ^2,  where  ei  and  62  may  be  either  positive  or 
negative.  The  correct  value  of  the  sum  of  the  measurements  lies 
between  the  limits: 

(a  +  ei)  +  (6  +  62)  =  (a  +  6)  +  {ex  +  e^) 

and 

(a  -  61)  +  (6  —  62)  =  (a  +  b)  ~  {ei  +  ^2) 

Hence  the  possible  error  in  the  sum,  a  +  6,  is     +  62. 

The  correct  value  in  the  difference  of  the  measurements  lies  be- 
tween the  limits : 

(a  +  ei)  -  (6  -  62)  =  (a  -  fc)  +  (ei  +  62) 

and 

(a  —  ei)  -  (b  +  62)  =  (a  -  6)  -  (61  +  62) 
1  J.  B.  Scarborough,  Numerical  Mathematical  AnalysiSj  p.  2. 
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Hence  the  possible  error  in  the  difference,  a  —  6,  is  ei  +  €2. 

Example.  The  sides  of  a  rectangular  field  are  measured  to  be  127'  ± 
0.2'  and  231'  d=  0.4'.   Find  the  possible  error  in  the  sum  of  the  two  sides. 

We  have: 

a  =  127,  b  =  231  d  =  0.2,  62  =  0.4 

a  +  6  =  358  ^1  +  ^2  =  0.6 

Hence  the  possible  error  is  0.6'  and  the  true  value  of  the  sum  of  the  two 
sides  is  between  358  —  0.6  and  358  +  0.6  feet. 

Theorem  11.  The  relative  error  in  the  product  of  two  measurements 
is  equal  to  the  sum  of  the  approximate  relative  errors  of  'the  individual 
measurements. 

With  the  same  notation  as  above,  the  product  will  lie  between: 

(a  +  ei){b  +  ^2)  =  ab  +  acn  +  bci  +  eie2 

and 

(a  —  ei)(b  —  C2)  —  ab  —  aco  —  bci  +  CiCo 

Since  ^'i  and  C2  are  both  small  when  com})ared  to  the  other  terms 
of  the  products,  we  shall  ignore  the  term  eie2.  c  then  have  the 
possible  error  in  the  product  to  be  approximately  ae^  +  bci. 

Hence  the  relative  error  in  the  product  is  approximately 

aei  H-  be]  _      _^  62 
ab  a  b 

which  is  the  sum  of  the  approximate  relative  errors. 

Example.  Find  the  absolute  and  the  relative  errors  in  the  computed 
area  of  the  rectangle  whose  sides  are  127'  ±  0.2'  and  231'  zb  0.4'. 

The  possible  error  in  the  product  is  approximately 

127(0.4)  +  231(0.2)  =  97  square  feet 

and  the  true  value  of  the  area  is  somewhere  between 

(127)  (231)  -  97  and  (127)  (231)  -f  97, 

that  is,  between  29,240  and  29,434  square  feet. 
The  relative  error  in  the  area  is  approximately: 

+  St  =  0.0032  =  0.32% 
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Theorem  III.  The  relative  error  in  the  quotient  of  two  measurements 
is  equal  to  the  sum  of  the  approximate  relative  errors  of  the  measure- 
ments. 

The  quotient  will  evidently  lie  between : 

a  +  ei  _  a  _^  ae2  +  bei 

6  —  62     b  b(b  —  62) 

and 

a  —  ei  _  a  ae^,  +  bei 

6  +  62     b  6(6  +  62) 

Since  e%  is  small  compared  with  6,  we  may,  for  purposes  of  approxi- 
mation, replace  6  +  ^2  and  6  —  62  by  6;  whence  the  possible  error 
in  the  quotient  is  approximately : 

aei  +  be\ 


Hence  the  relative  error  in  the  quotient  is  given  approximately  by 

aei  +  bei  ._  ^  _  ^  ,  ^2 
62  "~  "  6  ~  a  6 

which  is  the  sum  of  the  approximate  relative  errors. 

Example.  Find  the  possible  and  the  relative  errors  when  625  ±  0.7 
is  divided  by  36  db  0.2. 

We  have: 

a  =  625,  ei  =  0.7 
h  =    36,     62  =  0.2 

The  possible  error  in  the  quotient  is  given  approximately  by 

625(0.2)  +  36(0.7)  _  . 

36'^  "  ^'^^ 

and  the  true  value  of  the  quotient  will  therefore  lie  between 

-  0.12  and  —  +  0.12 
36  36 

that  is,  between  17.24  and  17.48. 

The  relative  error  in  the  quotient  is  given  approximately  by: 
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EXERCISES 

Make  each  of  the  following  computations  and  state  the  result  so  as  to 
show  a  measure  of  the  error  involved. 

1.  (125  ±  0.2)  +  (238  ±  0.3). 

2.  (215  ±  0,2)(115  ±  0.3). 

3.  (163  ±  0.2)/(25  ±  0.4). 

4.  What  is  the  possible  error  in  the  area  of  a  rectangle  whose  length 
and  width  are  recorded  as  50.4  ft.,  and  30.6  ft.? 

5.  a.  Show  that  if  e  is  the  error  in  the  side  of  a  square  whose  recorded 
length  is  a,  then  the  error  in  the  area  is  approximately  2ae. 

b.  Show  that  the  relative  error  in  the  area  is  approximately  twice  the 
relative  error  in  the  edge. 

6.  Show  that  the  relative  error  in  the  area  of  a  circle  is  approximately 
twice  the  relative  error  of  the  radius. 

7.  The  distance  from  the  earth  to  the  sun  is  given  as  93,000,000  db 
500,000  miles,  and  the  thickness  of  a  watch  spring  is  given  as  0.014  =t 
0.0005  inches.   Which  is  the  more  accurate  measurement? 

8.  Show  that  the  relative  error  in  the  volume  of  a  cube  is  approxi- 
mately three  times  the  relative  error  of  the  edge. 

9.  Show  that  the  relative  error  in  the  volume  of  a  sphere  is  approxi- 
mately three  times  the  relative  error  of  the  radius. 

10.  Are  statistical  data  always  approximate?  If  each  of  10  men  pays 
an  income  tax  of  $87,  is  their  total  contribution  $870  approximate? 

11.  Find  the  value  of  2(12x^  -  4x  -f  3). 

12.  Find  the  sum  of  3-7  -f  4-9  +  5- 11  +  •  •  •  to  n  terms. 

13.  Find  IP  +  12-  -f  13'^  +  •  •  •  +  50=. 

14.  Prove: 

^2x(3x  +  1)  =  2n(n  +  1)^ 
1 

16.  Use  the  result  of  Number  14  to  find  the  sums: 

(1)  12  •  19  +  14  •  22  -f  16  •  25  +  •  •  •  -|-  32  •  49. 

(2)  36  •  55  +  38  •  58  +  40  •  61  +  •  •  •  +  80  •  121. 

16.  Prove  that 

n(n  +  l)(4n  -f  5) 


n 


y:2x{2x  +  1) 
1  o 

17.  Use  the  result  of  Number  16  to  find  the  sums: 

(1)  12  •  13  +  14  •  15  +  16  .  17  -h  •  •  .  4-  96  .  97. 

(2)  48  •  49  +  50  •  51  +  52  •  53  +  •  •  •  +  90  •  91. 

18.  Find  in  terms  of  n  the  value  of  Xx{x  -hi) 

19.  Find  in  terms  of  n  the  value  of 

1-3  +  2-  4  +  3-  5+  '  •  -  ton  terma. 
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20. 


Find  identities  and  prove  that: 


n 


n(n  +  l)(2n  +  l)(3n^  +  3n  -  1) 

30 


b. 


1 


100 


21. 


Find  2x\ 


50 


22.  Find  1  •  2^  +  2  •  3^  +  3  •  4^  +  •  •  •  to  n  terms. 

23.  The  estimated  value  of  anthracite  coal  produced  in  Pennsylvania 
in  1929  was  $3,935(10^)  and  the  estimated  quantity  produced  was  7.664(100 
tons.  What  was  the  estimated  value  per  ton  and  what  was  the  relative 
error  in  the  estimate? 

24.  The  estimated  production  of  tobacco  in  the  United  States  in  1929 
was  1.5(10^)  pounds  and  the  estimated  price  received  was  10.0  cents  per 
pound.  What  was  the  estimated  value  of  the  crop?  What  was  the  per- 
centage error  in  the  estimated  value? 

25.  The  estimated  production  of  potatoes  in  the  United  States  in  1929 
was  3.57(10^)  bushels,  and  the  estimated  price  per  bushel  was  131.4  cents. 
What  was  the  estimated  value  of  the  crop?  What  is  the  relative  error  in 
the  estimated  value? 

26.  The  estimated  production  of  potatoes  in  the  United  States  in  1929 
was  3.57(10^)  bushels  and  the  estimated  acreage  was  3.37(10^).  What  was 
the  estimated  yield  per  acre?  What  is  the  relative  error  in  the  estimated 
yield? 

27.  A  teacher's  salary  of  $350  a  month  was  decreased  10  per  cent  and 
later  increased  5  per  cent.   What  is  his  present  salary? 

28.  City  A  increased  in  population  from  25,750  to  35,890  in  a  decade, 
and  City  B  increased  from  255,000  to  350,000  during  the  same  decade. 
Which  city  had  the  greater  percentage  increase? 

29.  The  general  price  level  rose  80  per  cent,  then  declined  33i  per  cent. 
How  much  was  it  then  above  its  starting  point? 

30.  A  man  whose  salary  was  actually  $275.00  a  month  was  reported 
to  be  receiving  $300.00  a  month.  What  was  the  percentage  error  in  the 
report? 

31.  A  teacher's  salary  of  $200  a  month  was  decreased  25  per  cent  and 
then  increased  25  per  cent.   What  is  his  present  salary? 

32.  The  value  of  the  exports  from  the  United  States  to  a  neighboring 
country  in  1934  was  15  per  cent  less  than  the  value  in  1933,  but  in  1935 
was  10  per  cent  greater  than  the  value  in  1934.  Compare  the  value  in 
1935  with  that  in  1933. 

33.  The  number  of  registered  passenger  automobiles  in  the  United 
States  in  1929  was  2.3122(10^  and  the  estimated  population  the  same 
year  was  1.22(10^).  What  was  the  estimated  number  of  people  for  each 
passenger  automobile?  What  was  the  relative  error  in  the  estimate? 
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TABULAR  AND  GRAPHICAL  REPRESENTATION: 
FREQUENCY  DISTRIBUTIONS 

11.  INTRODUCTION 

Almost  without  exception  the  object  of  a  statistical  analysis  is  to 
form  a  judgment  of  a  very  large  universe  by  means  of  a  study  of  a 
small  part  of  it.  The  large  universe  we  call  the  parent  population,^  and 
the  part  of  it  that  we  use  as  a  basis  for  generalization  we  call  a 
sample. 

In  some  cases  it  is  impossible  to  measure  the  entire  parent  popula- 
tion and  in  other  cases  it  is  impracticable  to  do  so.  Suppose  a 
physician  was  interested  in  the  blood  pressure  of  American  men 
between  thirty  and  thirty-one  years  of  age.  He  could  never  expect 
to  get  complete  data  for  all  the  men  in  the  parent  population.  Not 
only  would  it  be  impossible;  it  would  be  unnecessary,  expensive, 
and  a  waste  of  time  and  energy.  An  excellent  judgment  could  be 
made  by  the  study  of  a  properly  selected  sample.- 

Our  first  task  therefore  is  to  secure  the  data  of  a  properly  selected 
sample,  and  then  proceed  to  the  analysis.  The  analysis  will  give  us  a 
summarized  numerical  description  of  the  sample  from  which  we  may, 
if  we  desire,  form  certain  judgments  of  the  panMit  population. 

12.   CLASSIFICATION  OF  THE  DATA 

When  a  mass  of  data  has  been  asscm})led  it  is  necessary  to  classify 
the  material  in  some  compact  and  orderly  form  before  it  can  be 
effectively  analyzed.  This  procedure  is  known  by  statisticians  as 
tabulation.  It  is  merely  the  arrangement  of  the  data  into  tables,  or 
in  a  tabular  form.  The  data  in  the  original  form  are  ungrouped; 
when  they  are  summarized  into  a  table  they  are  grouped. 

The  following  table.  Table  6,  gives  the  scores  made  in  college 

^  StativStically  speaking,  any  mass  of  data  is  a  population. 
'  Chapter  13  will  deal  more  specifically  with  the  problem  of  sampling. 
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algebra  by  125  first-year  students  at  Bucknell  University.  These 
scores  constitute  a  sample  selected  at  random  from  a  larger  popula- 
tion. The  grades  are  given  to  the  nearest  integer  on  the  centigrade 
scale.  This  means,  we  recall,  that  a  grade  recorded  as  92  might 
represent  any  mark  between  91.5  and  92.5.  We  note  that  the 
lowest  recorded  score  is  48  and  the  highest  score  is  97,  giving  a 
range  of  97  ~  48  =  49.  The  possible  range  is  from  47.5  to  97.5,  or  50. 

Table  6.   Semester  Grades  of  125  Students  in  College  Algebra 

AT  Bucknell  University 
(Grades  recorded  to  the  nearest  integer) 


93 

83 

77 

75 

70 

88 

69 

68 

71 

63 

86 

58 

53 

50 

95 

79 

89 

87 

84 

78 

82 

81 

78 

81 

74 

80 

75 

76 

77 

75 

73 

48 

76 

69 

55 

74 

62 

95 

90 

84 

75 

87 

65 

70 

68 

76 

70 

55 

63 

79 

65 

80 

97 

91 

64 

68 

70 

79 

86 

83 

80 

57 

60 

65 

79 

80 

76 

82 

75 

60 

75 

77 

62 

59  . 

92 

85 

73 

74 

77 

70 

68 

65 

70 

72 

69 

90 

85 

85. 

81 

80 

77 

67 

66 

67 

63 

77 

73 

74 

75 

73 

69 

81 

80 

72 

72 

85 

82 

77 

73 

73 

74 

74 

75 

72 

70 

71 

75 

76 

76 

77 

74 

75 

71 

70 

72 

If  these  grades  in  college  algebra  are  arranged  in  the  order  of 
magnitude  the  array  will  be  more  suitable  for  study  than  in  the 
haphazard  arrangement  in  Table  6,  yet  even  the  grades  arranged 
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in  this  manner  will  still  be  unwieldy  for  a  close  analysis.  A  really 
compact  form  may  be  obtained  by  arranging  the  measures  into 
classes  of  equal  width,  for  example,  47.5-52.5,  52.5-57.5,  etc.,  wherein 
the  class  interval  or  class  width  is  5  points.  The  number  of  items  ot 
measures  occurring  in  each  class  (called  the  class  frequency)  is  then 
determined  by  tallying. 

The  traditional  method  of  tallying  is  to  record  the  frequencies  by 
marks  until  four  have  been  made,  then  to  make  a  cross  mark  for  the 
fifth  score.   This  procedure  makes  up  the  preliminary  sheet. 

The  procedure  described  above  for  tallying  offers  no  facilities  for 
checking.  If  a  repetition  of  the  classification  leads  to  a  different 
result,  we  have  no  means  of  tracing  the  error.  If  the  number  of 
observations  is  large,  it  is  better  to  enter  the  values  on  cards,  one 
card  to  each  measure,  then  sort  the  cards  into  the  clavsscs  we  desire. 
We  can  then  check  each  pack,  thereby  placing  each  measure  in  the 
proper  class. 

The  tabular  arrangement  —  illustrated  by  Table  7  —  consisting 
of  a  series  of  classes  and  a  corresponding  set  of  frequencies  is  called  a 
simple  frequency  distribution. "  We  designate  the  total  frequency  by  A^. 


Table  7.    Semester  Grades  of  125  Students  in  College  Algebra 

Preliminary  Sheet 


Class 


92.5-97.5 
87.5-92.5 
82.5-87.5 
77.5-82.5 
72.5-77.5 

67.5-72.5 
ry2.5-67.5 
57.5-62.5 
52.5-57.5 
47.5-52.5 


Total 


nil  , 

rm  m  II 
mmm  iii 

mmm  II 
Tfu  m  wi  m  nil 
m  m  I 
mi  I 

nil 
II 


F  requency 


4 
6 
12 
19 

37 
24 
11 
(i 
4 

9 


125  =  .V 


The  organization  of  the  data  has  thus  been  effected  and  the  data 
are  now  prepared  for  the  next  step,  the  analysis. 
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Table  8.    Semester  Grades  of  125  Students  in  College  Algebra 

(Grades  recorded  to  the  nearest  integer) 

Form  (a)  Form  (b) 


Class 

\y  IdSo  IVl  UTK 

Y 

U  l/USS  IVI  CLTK 
Y 

r  requsncy 

t(  Y\ 

00  eL_Q7  f\ 
V^.ljr^vi  .0 

yo 

A 
*± 

yo 

87.5-92.5 

90 

6 

90 

6 

82.5-87.5 

85 

12 

85 

12 

77.5-82.5 

80 

19 

80 

19 

72.5-77.5 

75 

37 

75 

37 

67.5-72.5 

70 

24 

70 

24 

62.5-67.5 

65 

11 

65 

11 

57.5-62.5 

60 

6 

60 

6 

52.5-57.5 

55 

4 

55 

4 

47.5-52.5 

50 

2 

50 

2 

Total 

125  =  N 

Total 

125  =  N 

In  the  preparation  of  Table  8  we  were  cognizant  that  the  data  are 
continuous  and  are  recorded  to  the  nearest  integer.  A  score  recorded 
as  79,  for  example,  really  fell  somewhere  over  the  interval  78.5  to 
79.5.  Consequently  we  found  it  convenient  to  represent  the  end 
values  of  the  class  intervals  to  tenths.  If  the  data  had  been  recorded 
to  tenths,  we  could  have  expressed  the  two  figures  defining  each  class 
to  hundredths. 

The  two  figures  that  define  a  class  are  called  the  class  limits  of 
the  class.  In  some  tabular  representation  of  classes,  the  defining 
numbers  of  the  class  are  true  class  limits  or  class  boundaries.^  The 
class  boundaries  can  easily  be  determined  as  each  boundary  is  half 
way  between  the  largest  item  in  the  lower  dass  and  the  smallest  item  in 
the  next  higher  class.  Thus  in  Form  (a)  above  the  largest  measure 
in  the  lowest  class  is  52  and  the  smallest  value  in  the  next  higher  class 
is  53.  The  class  boundary  is  half  way  between  52  and  53,  that  is  at 
52.5.  The  other  boundary  points  in  Form  (a)  can  be  found  in  a 
similar  manner. 

The  difference  between  the  lower  boundary  of  one  class  and  the 
lower  boundary  of  the  next  higher  class  is  the  class  interval  or  class 
width.    The  class  interval  is  also  the  difference  between  the  upper 

^  Some  authors  call  class  boundaries  closed  class  limits. 
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boundaries  of  two  adjacent  classes.  The  upper  boundary  of  one  class 
is  the  lower  boundary  of  the  next  higher  class,  and  the  lower  boundary 
of  one  class  is  the  upper  boundary  of  the  next  lower  class.  That  is, 
for  continuous  data  adjacent  classes  should  ''join  up"  or  be  con- 
tiguous. The  number  half  way  between  the  upper  and  lower  bound- 
aries of  a  class  is  the  class  mark.  Thus 

p,,  ,  _  Upper  boundary  +  Lower  boundary 
v^lass  marR  —  ^  ~ 

A  class  boundary  is  half  way  between  the  class  marks  of  two  adjacent 
classes.  The  class  boundaries  of  a  class  can  be  found  by  adding  to 
and  subtracting  from  the  class  mark  one  half  the  class  width.  With 
this  in  mind,  Form  (b)  is  a  mere  abridgment  of  Form  (a). 

Form  (a),  using  class  boundaries,  is  a  widely  used  method  of 
indicating  the  classes  of  a  simple  frequency  distribution.  It  is 
suitable  to  discrete  as  well  as  to  continuous  data,  and  wo  recommend 
it  as  our  favorite  method.  However  other  methods  for  defining  the 
classes  are  found  in  the  Uterature  of  the  subject.  We  shall  present 
and  discuss  some  well  known  forms  to  which  the  data  of  Table  8  may 
be  applied. 

Form  (c) 


Class 

Class 
Boundaries 

Class  Mark 

C07ltlTLU0US 

Class  Mark 
Discrete 

93  a.u.  98 

92.5-97.5 

95 

95 

88  a.u.  93 

87.5-92.5 

90 

90 

etc. 

etc. 

etc. 

etc. 

In  Form  (c),  ''93  a.u.  98'^  moans  "93  and  under  98.''  That  is, 
in  this  cla^s  are  found  the  measures  as  large  as  93  but  loss  than  98. 
The  classes  in  Form  (c)  arc  defined  by  class  limits  but  not  by  class 
boundaries.  For  clearness,  we  give  the  class  boundaries  which  in 
turn  assist  us  in  finding  the  class  marks.  Form  (c)  is  suitable  for 
continuous  and  discrete  data,  but  in  using  this  form  the  student  mvist 
recall  that  a  score  of  93  means  any  number  in  the  interval  92.5  to 
93.5  and  thus  the  lower  boundary  is  92.5.  Similarly,  a  score  of  88  has 
a  lower  boundary  at  87.5.  The  class  marks  are  now  easily  determined. 

Occasionally  the  classes  are  denoted  by  the  smallest  and  largest 
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measures  of  a  given  class,  and  the  class  interval  may  appear  to  range 
from  the  smallest  to  the  largest  measurement  for  each  class.  For 
continuous  variates,  this  method  of  defining  the  class  does  not  show 
the  full  range  of  the  class  and  leaves  gaps  at  the  ends  of  the  class. 
In  this,  as  in  all  forms  of  class  representation,  the  statistician  must 
ascribe  to  each  class  the  true  class  limits  or  the  class  boundarieSj  and 


Form  (d) 


i 

Class 

Class 
Boundaries 

r 

Class  Mark 
Continuous 

Class  Mark 
Discrete 

93-97 

92.5-97.5 

95 

95 

88-92 

87.5-92.5 

90 

90 

etc. 

etc. 

etc. 

etc. 

the  true  class  mark.  Thus  in  Form  (d)  in  the  given  classes  we  indicate 
the  class  limits  by  the  smallest  and  largest  values  that  may  fall  in  a 
given  class.  We  have  included,  for  emphasis  and  for  clearness,  the 
class  boundaries. 

Occasionally  we  find  in  the  literature  a  tabular  representation 
similar  to  Form  (e).  This  form  states  ambiguously  what  Form  (c) 
states  more  definitely.   It  is  unsafe  for  tallying  scores  for  the  reason 


Form  (e) 


Class 

Cla^s 
Boundaries 

Class  Mark 
Continuous 

Class  Mark 
Discrete 

93-98 

92.5-97.5 

95 

95 

88-93 

87.5-92.5 

90 

90 

etc. 

etc. 

etc. 

etc. 

that  it  is  easy  to  mis-tally  boundary  scores.  Thus,  to  which  class 
would  a  score  of  93  belong?  Again,  we  have  included  the  class  bound- 
aries for  sake  of  clearness,  also  the  class  marks. 

In  later  chapters  we  shall  find  it  necessary  to  locate  certain  division 
points  on  the  X-scale:  quartiles,  deciles,  percentiles.  To  find  these 
points  we  shall  need  true  class  limits  or  class  boundaries. 

The  determination  of  true  class  marks  is  also  very  important  as 
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many  of  our  statistical  constants,  such  as  the  arithmetic  mean  and 
the  standard  deviation,  are  found  from  the  class  marks  of  the  classes. 
In  fact  to  save  labor  in  computation,  we  shall  find  it  necessary  to 
assume  that  the  items  are  uniformly  distributed  over  the  given 
intervals  and  that  the  class  frequencies  are  concentrated  at  the 
class  marks. 

From  this  discussion  of  the  several  forms  it  is  evident  that,  inas- 
much as  the  class  boundaries  must  eventually  be  found  to  aid  in  the 
analysis  of  the  data,  we  can  save  ourselves  confusion  and  time  by 
adopting  class  boundaries  in  the  beginning  of  our  problem.  This  pro- 
cedure we  have  followed  and  it  is  one  we  highly  recommend. 

It  should  be  emphasized  that  when  a  score  is  tabulated  in  the 
proper  class  interval,  it  loses  its  identity.  Of  course  it  falls  somewhere 
within  the  boundaries  of  the  interval,  but  in  computation  we  do  not 
use  it  again.  For  computational  purposes  in  effecting  the  numerical 
analysis,  it  is  necessary  that  we  concentrate  the  class  frequency  at 
the  mid-point  of  the  class  interval.  Thus,  in  our  computations  on 
Table  8,  we  replace  the  scores  93,  95,  97,  95  of  the  class  92.5  —  97.5 
by  four  scores  each  of  value  95,  the  mid-value  of  the  class.  Similarly, 
we  replace  the  scores  88,  90,  89,  90,  91,  92  of  the  class  87.5  -  92.5 
by  six  scores  each  of  value  90,  the  mid-value  of  the  class.  And  so 
on  for  the  other  classes. 

While  our  assumption  that  the  scores  are  evenly  distributed  over 
the  interval  is  seldom  verified  by  observed  data,  yet  if  the  sample  is 
sufficiently  numerous  the  assumption  leads  only  to  a  very  sUght  error. 
Some  such  assumption  must  be  made,  and  experience  and  statistical 
theory  recommend  the  assumptions  of  evenness  of  measures  over 

the  interval  and  the  concentration  of  the  class  frequency  at  the  mid- 
point of  the  class  interval. 

Example  1.  If  10  scores  in  integral  variates  are  evenly  distributed  over 
the  interval  72.5  —  77.5,  what  are  the  scores? 

Since  the  scores  are  integers,  10  in  number,  and  must  be  evenly  dis- 
tributed over  the  interval,  they  would  have  the  values  73,  73,  74,  74,  75, 
75,  76,  76,  77,  77. 

Example  2.  Are  the  values  73,  73,  73,  73,  73,  77,  77,  77,  77,  77  evenly 
distributed  over  the  interval  72.5  —  77.5? 

No.  While  the  statistical  results  are  essentially  the  same  as  if  the  entire 
10  scores  are  situated  at  the  class  mark,  75,  these  values  are  not  evenly 
distributed  over  the  given  interval. 
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Example  3.  If  20  measurements,  rounded  to  the  nearest  half-inch,  are 
evenly  distributed  over  the  interval  72.25  —  82.25,  what  are  their  values? 

Their  values  are:  72.5,  73.0,  73.5,  74.0,  .  .  .,  82.0. 

Would  two  of  each  of  the  following  10  measurements  be  satisfactory: 
73,  74,  75,  .  .  82? 

Example  4.  What  measurements  would  satisfy  for  the  preceding  example 
if  the  interval  were  72  a.u.  82?  What  is  the  upper  boundary  of  the  class? 

The  values  would  be:  72.0,  72.5,  73.0,  .  .  .,  81.0,  81.5.  The  largest 
value  in  the  class  is  81.5  and  the  smallest  value  in  the  next  higher  class  is 
82.0.  The  class  boundary  is  the  value  half  way  between  them,  namely 
81.75. 

EXERCISES 

1.  Suppose  the  data  are  dinner  checks  from  a  cafeteria.  Show  that  two 
checks  for  each  of  the  values  93,  94,  95,  96,  97  cents  would  give  the  same 
total  as  10  checks  of  95  cents  each. 

2.  Suppose  the  temperature  at  Lewisburg  is  recorded  to  the  nearest  tenth 
of  a  degree  and  that  a  5  degree  class  interval  has  been  selected.  If  the  class 
limits  are  60.0  —  64.9,  55.0  —  59.9,  50.0  —  54.9,  etc.,  what  are  the  class 
boundaries  and  the  class  marks  of  the  three  classes? 

3.  A  group  of  intelligence  quotients  (continuous  data)  are  arranged  with 
the  class  intervals  as  follows:  75  —  79,  80  —  84,  etc.  What  are  the  clase 
boundaries  and  the  class  marks? 

4.  What  values  are  contained  in  the  interval  75  —  79  if  the  data  are 
discrete?  If  the  data  are  continuous  and  recorded  to  the  nearest  integer? 

13.   THE  CHOICE  OF  THE  CLASS  INTERVAL 

In  the  choice  of  a  class  interval,  the  following  brief  suggestions  may 
be  helpful: 

1.  The  number  of  classes  should,  in  general,  not  be  less  than  10  nor 
more  than  30,  seldom  more  than  25. 

2.  If  possible,  the  class  intervals  should  be  uniform  in  width. 

3.  In  general  there  should  be  no  class  intervals  without  definite  limits. 
Intervals  of  the  type  "all  over"  and  '^'ill  under"  are  to  be  avoided 
when  possible.^ 

4.  To  facilitate  computation,  class  intervals  of  multiples  of  5  or  10 
are  convenient. 

14.   CLASS  LIMITS 

The  lowest  limit  of  the  lowest  class  may  be  chosen  in  many  posi- 
tions.   This  choice  and  that  of  the  class  interval  will  practically 

^  Many  of  the  tables  found  in  the  data  sent  out  by  the  United  States  Govern- 
ment are  of  this  type. 
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determine  the  limits  of  the  other  classes.  We  rather  hesitate  to 
state  many  rules  for  their  selection;  much  must  be  left  to  the  judg- 
ment and  resourcefulness  of  the  student.  The  following  suggestions 
should  prove  helpful: 

1.  To  facilitate  computation,  the  mid-points  should  be  integers.  We 
shall  find  tliat  carrying  out  this  suggestion  is  frequently  impossible. 

2.  Certain  types  of  data  are  loaded  at  special  points.  For  example, 
college  marks  on  a  centigrade  scale  are  loaded  at  60,  65,  70,  75,  etc. 
Distributions  in  which  age  is  the  independent  variable  are  usually 
loaded  at  20,  25,  30,  etc.  When  the  data  display  such  a  peculiarity, 
these  loaded  points  should  be  chosen  as  mid-points  of  the  class  in- 
tervals. This  is  especially  to  be  kej)t  in  mind,  since  in  the  Unalysis 
of  our  distributions  we  shall  assume  that  all  measures  of  a  class  are 
concentrated  at  the  mid-point  of  the  class. 

3.  Class  limits  should  be  unambiguous  and  mutually  exclusive. 

The  class  limits  can  be  decided  accuratel}^  only  when  the  accuracy 
of  the  data  is  known.  A  score  for  either  type  of  variate  is  assumed  to 
extend  from  half  a  imit  in  the  last  i)lace  of  measurement  below  to  half 
a  unit  above  the  entry  recorded.  If  the  data  are  accurate  to  tenths, 
the  class  limits  should  be  expressed  to  hundredths;  if  heights  are 
measured  to  the  nearest  (piarter  of  an  inch,  the  class  limits  should 
be  arranged  in  eighths  of  an  inch. 

In  many  of  the  exercises  that  follow  in  the  text,  it  will  not  be 
possible  to  carry  out  the  suggestion  of  the  preceding  paragraph, 
because  the  original  observers  were  not  meticulously  careful  to  state 
the  accuracy  of  the  original  measurements.  When  this  is  the  case 
we  shall  have  to  make  some  reasonable  assimiptions  and  proceed 
along  the  lino  suggested  l^y  them. 


EXERCISES 

1.  The  following  diagram  for  the  first  class  of  Form  (a),  Table  8,  shows 
that  the  mid-point  of  the  class  (the  class  mark)  is  95.  Make  similar  dia- 
grams to  explain  the  other  Forms. 


Diagram  1 

92.5    93  94:  95  96  97  9L5 

2.  Suppose  you  were  asked  to  construct  a  frequency  table  of  the  grades 
in  Table  6  (7).  24)  with  the  class  marks  at  97.5,  92.5,  etc.  and  with  a  clase 
interval  of  5,  what  would  you  say? 
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3.  An  educational  research  department  recently  sent  the  author  a 
score  card  for  some  data  (to  the  nearest  integer).  The  Class  column  was 
marked  thus:  0-4,  5-9,  10-14,  etc.  What  are  the  class  marks?  the  true 
class  limits? 

4.  The  daily  wages  of  100  men  were  recorded  to  the  nearest  cent. 
Complete  the  table  finding  the  class  boundaries  and  the  class  marks. 


Class 

Class  Boundaries 

Class  Mark 

f{X) 

$2.25-2.49 

5 

2.50-2.74 

11 

2.75-2.99 

23 

3.00-3.24 

29 

3.25-3.49 

17 

3.50-3.74 

9 

3.75-3.99 

6 

Total 

100 

5.  The  weights  of  1,000  male  students  (in  pounds)  were  recorded  to  the 
nearest  half  pound.   Complete  the  table. 


Class  Boundaries 

Class  Mark 

fiX) 

105.25 

4 

115.25 

12 

125.25 

20 

etc. 

etc. 

6.  The  heights  of  1,000  male  students  (in  inches)  were  recorded  to  the 
nearest  tenth  of  inch.  Complete  the  table. 


Class 

Class  Boundaries 

Class  Alark 

f{X) 

60.8  a.u.  62.8 

61.75 

1 

62.8  a.u.  64.8 

63.75 

3 

64.8  a.u.  66.8 

65.75 

11 

etc. 

etc. 

etc. 

7.  The  ages  at  marriage  of  100  women  were  distributed  as  shown  in 
the  table.  Find  the  class  boundaries. 
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15-19 

17 

4 

20-24 

22 

28 

25-29 

27 

23 

etc. 

etc. 

etc. 

8.  The  number  of  pedicels  per  cluster  of  a  certain  plant  resulted  in  the 
following  distribution.   Find  the  class  boundaries. 


Class 

Class  Boundaries 

Class  Mark 

fiX) 

12-19 

15.5 

8 

20-27 

23.5 

52 

28-35 

31.5 

176 

etc. 

etc. 

etc. 

9.  A  distribution  of  heights  of  students  (in  centimeters)  was  arranged 
as  follows: 


Height 
(centimeters) 

Class  Mark 

155-157 

156 

4 

158-160 

159 

8 

161-163 

162 

26 

etc. 

etc. 

etc. 

What  are  the  class  boundaries? 

Can  you  guess  at  the  accuracy  of  the  original  measurements? 
Do  you  agree  with  the  chtss  marks? 

10.  Suppose  you  are  given  500  grades  (in  per  cent)  in  English  to  dis- 
tribute. The  lowest  grade  is  20%  and  the  highest  grade  is  90%.  You  de- 
cide upon  a  class  width  of  5%.  Which  of  the  two  groupings,  A  or  B,  would 
be  preferable? 

A  B 


Class 

X 

f(X) 

Class 

X 

fiX) 

19.5-24.5 

22 

17.5-22.5 

20 

24..5-29.5 

27 

22.5-27.5 

25 

etc. 

etc. 

etc. 

etc. 
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11.  In  his  book,  **The  Fundamentals  of  Statistics/'  Professor  L.  L. 
Thurstone  tabulates  the  scores  made  on  an  intelligence  test  by  140  fresh- 
men at  Swarthmore  College.   His  table  follows: 


Scores 

Class  Mark 

f(X) 

40-49 

45 

1 

50-59 

55 

5 

60-69 

65 

12 

70-79 

75 

21 

80-89 

85 

23 

90-99 

95 

23 

100-109 

105 

25 

110-119 

115 

14 

120-129 

125 

11 

130-139 

135 

4 

140-149 

145 

1 

Total 

140 

What  are  the  class  boundaries? 

Using  our  assumptions,  what  are  the  class  marks  of  the  classes? 
What  are  the  tacit  assumptions  that  Professor  Thurstone  makes  regard- 
ing the  extreme  scores  in  the  classes? 

12.  The  following  data  pertain  to  the  ages  of  unemployed  male  workers 
in  Boston  in  1930.  Professor  R.  C.  White  in  his  Social  Statistics," 
page  215,  takes  the  classes  and  the  class  marks  as  shown. 


Age  (Years) 

Class  Mark 

fiX) 

10-14 

12.5 

15-19 

17.5 

20-24 

22.5 

etc. 

etc. 

What  are  the  class  boundaries? 
Do  you  agree  with  the  class  marks? 

What  assumptions  does  Professor  White  evidently  make  regarding  the 
largest  and  smallest  ages  of  a  class? 

13.  In  his  "Statistical  Methods,  Revised,"  Professor  F.  C.  Mills  ex- 
hibits on  page  105  a  distribution  of  the  weekly  earnings  of  workers  in  open- 
hearth  furnaces  in  the  Pittsburgh  district  in  1935.  A  portion  of  the  table 
is  shown  here. 
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Class  Interval 
yiTi  aouars  per  weetc) 

Mid-point 

Frequency 

$0-  3.99 

2 

67 

4-  7,99 

6 

290 

8-1L99 

10 

437 

etc. 

etc. 

etc. 

What  are  Professor  Mills'  assumptions  regarding  the  values  that  are 
placed  in  the  given  classes? 

According  to  our  assumptions  what  would  be  the  mid-points  of  the 
classes? 

14.  In  Davies  and  Yoder,  "Business  Statistics/'  pages  110  and  114  we 
find  the  following  distribution: 


X 

/ 

10-12 

11 

3 

12-14 

13 

15 

14-1() 

15 

20 

16-18 

17 

10 

18-20 

19 

2 

What  are  the  assumptions  of  the  authors  regarding  the  values  that  are 
placed  in  the  several  clashes? 

According  to  our  assumptions  what  would  be  the  values  of  X? 

15.  In  liis  lM)()k,  "Statistics  for  Students  of  Psychology  and  Educa- 
tion," Professor  Herbert  Sorcn^^on  on  page  43  exhibits  a  distribution  of 
scores  obtained  on  an  objective  test  in  educational  i)sychology.  Here  is  a 
portion  of  the  tai)le. 


Scores  by  Intervals 

X 

/ 

80-84 

82.5 

3 

75-79 

77.5 

5 

70-74 

72.5 

7 

etc. 

etc. 

etc. 

What  are  Professor  Sorenson's  assum])tions  regarding  the  scores  in  the 
given  intervals?    (See  page  44  of  his  text.) 

What  are  the  class  boundaries  and  the  values  of  according  to  our 
assumptions? 
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16.  In  his  book,  ''The  Mathematical  Part  of  Elementary  Statistics/' 
Professor  B.  H.  Camp  gives  on  page  8  a  distribution  of  wage  data,  a 
portion  of  which  we  show  here.  What  are  Professor  Camp's  assumptions 
regarding  the  scores  in  the  given  intervals?  \\  hat  are  the  class  boundaries 
of  the  classes? 


Class 

Mid-value 

/ 

$4.50-5.99 

5.245 

43 

6.00-7.49 

6.745 

99 

etc. 

etc. 

etc. 

The  illustrations  found  in  these  Exercises  certainly  show  that 
authorities  differ  in  their  interpretations  of  class  limits,  class  marks, 
class  boundaries,  et  cetera.  In  reading  the  literature  of  our  field 
we  must  be  alert,  therefore,  to  the  assumptions,  either  tacit  or  ex- 
pressed, that  guide  the  procedure.  Further,  we  must  be  charitable 
and  seek  to  understand  what  are  the  assumptions  that  are  guiding 
an  author's  steps,  and  realize  that  there  is  more  than  one  way  of 
doing  a  simple  statistical  task. 

The  problem  we  are  discussing  is  simpl}^  this:  what  is  meant  by 
a  recorded  score  of  74?  of  74.6?  of  74.67?  We  assume  that  if  a  score 
is  recorded  74,  its  value  is  between  73.5  and  74.5;  if  it  is  recorded 
74.6,  its  value  is  between  74.55  and  74.65;  and  so  on.  On  the  con- 
trary, many  statisticians  assume  that  if  a  score  is  recorded  74,  its 
value  ranges  from  74  to  but  not  including  75;  if  a  score  is  recorded 
74.6,  it  ranges  from  74.6  to  but  not  including  74.7;  and  so  on.  They 
also  use  another  method  of  description.  They  assume  that  a  recorded 
score  of  74  ranges  from  74  to  74.99;  a  recorded  score  of  74.6  ranges 
from  74.6  to  74.699;  and  so  on. 

The  mathematician,  accustomed  to  rigor  in  his  thinking,  generally 
prefers  our  method  of  description,  namely,  that  the  classes  V)e  de- 
termined rigidly  by  class  boundaries,  whereas  the  worker  in  an  applied 
field  may  be  willing  to  sacrifice  some  rigor.  This  is  one  of  the  con- 
troversial questions  in  statistical  procedure  so  let  us  not  assume  that 
we  have  the  full  truth.  After  all,  it  is  not  a  matter  of  extreme  im- 
portance whether  the  scores  on  an  English  test  average  74.26%  or 
74.16%.  It  is  essential,  however,  that  we  impose  refinements  when 
the  data  warrant  them.  It  is  just  as  essential  that  we  do  not  give 
a  false  impression  of  accuracy  in  our  procedures. 


GRAPHICAL  REPRESENTATION 


37 


16.  GRAPHICAL  REPRESENTATION 

When  the  data  have  been  organized  into  a  suitable  table,  they  are 
now  ready  for  the  first  step  in  the  analysis,  that  of  presenting  the 
data  graphically.  Graphical  presentations  display  outstanding  facts 
and  bring  into  bold  relief  relationships  that  otherwise  would  be 
difficult  to  comprehend  or  possibly  would  not  be  noted  at  all.  A 
column  of  figures  may  overwhelm  us;  the  same  data  in  graphic  form 
may  tell  an  easily  understood  story.  Relative  quantities  especially 
can  be  grasped  through  visual  means  with  a  comprehensiveness  that 
is  not  possible  by  pure  analysis. 

While  the  ultimate  basis  of  graphical  presentation  is  mathematical, 
yet  the  practical  work  of  constructing  the  charts  can  be  accomplished 
without  a  profound  knowledge  of  the  true  mathematical  basis. 
Charts  and  graphs,  then,  can  enable  us  to  discover  simply  and 
quickly  many  facts  and  mathematical  relationships  about  numerical 
data  without  the  use  of  more  difficult  methods  of  analysis.  The 
careful  statistician,  however,  will  be  very  cautious  to  verify  by  the 
more  precise  methods  of  analysis  the  suggestions  that  he  receives 
from  the  graph. 

It  is  not  our  intention  to  present  in  this  book  a  detailed  account 
of  the  many  graphical  procedures  that  are  used  today.  We  shall 
explain  certain  important  principles  of* graphic  presentation  and  leave 
it  to  the  reader  who  desires  a  more  comprehensive  knowledge  to 
consult  the  excellent  volumes  that  are  accessible.^ 

16.   GRAPHICAL  REPRESENTATION  OF  FREQUENCY 

DISTRIBUTIONS 

Probably  the  best  graphical  representation  of  a  simple  frequency 
distribution  is  furnished  by  a  column  diagram  or  histogram.  It  is 
constructed  by  erecting  upon  the  class  intervals  rectangles  whose 
altitudes  are  proportional  to  the  frequencies.  Suitable  scales  must 
be  chosen  so  that  the  graph  of  the  data  can  be  made  to  fit  the  data, 
be  of  sufficient  size  to  be  readily  interpreted,  and  be  of  such  propor- 
tions that  it  will  be  agreeable  to  our  artistic  tastes.  The  left-hand 
side  of  the  first  rectangle  is  plotted  at  the  lower  boundary  of  the 

^  Excellent  references  for  graphical  presentation  are  listed  in  the  bibliography 
in  Appendix  A  of  this  volume. 
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lowest  class  and  the  right-hand  side  of  the  last  rectangle  is  plotted  at 
the  upper  boundary  of  the  highest  class.  Chart  1  shows  the  histogram 
for  the  distribution  of  grades  in  college  algebra  previously  tabulated 
in  Table  8  (p.  26). 

The  student  will  note  that  each  rectangle  contains  an  area  that  is 
proportional  to  and  represents  the  frequency  of  the  class  and  that 

Chart  1 


the  total  area  equals  the  total  frequency  times  the  class  width.  If  the 
class  width  is  taken  as  the  unit,  the  total  area  equals  the  total  fre- 
quency. 

Another  method  of  representing  graphically  a  frequency  distribu- 
tion is  by  what  is  called  a  frequency  polygon.  Its  construction  is 
very  much  like  the  plotting  of  curves  and  line  diagrams  in  elementary 
algebra.  In  form  (b)  of  Table  8  (p.  26),  each  pair  of  values  X, 
f  (X)  y  defines  a  point.  Plotting  the  several  points  and  connecting  them 
by  a  broken  line,  we  obtain  the  frequency  polygon.  The  last  points  at 
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either  end  must  be  joined  to  the  base  at  the  center  of  the  next  class 
interval.  The  observing  student  will  note  that  the  vertices  of  the 
frequency  polygon  are  merely  the  mid-points  of  the  tops  of  the  rec- 
tangles of  the  histogram,  and  that  the  ordinates  represent  the  frequen- 
cies. Chart  2  shows  the  frequency  polygon  for  the  grades  in  college 
algebra  displayed  in  tabular  form  in  Table  8  (p.  26). 

The  fact  that  the  polygon  extends  beyond  the  limits  of  the  table 
suggests  that  if  the  grades  of  a  larger  group  of  students  were  taken, 

Chart  2 


Grades  of  125  Students  in  College  Algebra 


4d    SO     65    60     65    70     75    80     85    90     95  100 

Grades 

a  few  would  have  been  found  with  grades  less  than  any  in  our  sample 
and  a  few  with  grades  larger  than  any  in  our  sample.  Both  the 
histogram  and  the  polygon  show  graphically  the  outstanding  facts 
of  the  sample  considered.  If  one  is  interested  in  the  sample  only, 
this  representation  is  sufficient.  However,  the  purpose  of  the 
investigation  would  usually  be  to  answer  certain  questions  regarding 
the  larger  group,  the  parent  population^  of  all  the  grades  at  this 
institution  in  college  algebra.  The  frequency  polygon  for  the  parent 
population  would  resemble  very  closely  a  smooth  curve. 
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If  the  class  interval  be  made  smaller  and  smaller  and  the  total 
frequency,  N,  be  increased  without  limit,  the  limit  approached  by 
the  histogram  and  the  frequency  polygon  is  termed  a  frequency  curve. 
In  Chart  1,  a  frequency  curve  has  been  drawn. 

It  should  be  borne  in  mind  that  this  frequency  curve  brings  out  the 
general  tendencies  of  the  parent  population  by  means  of  what  we  have 
assumed  to  be  a  representative  sample.  This  curve  meets  the  base 
line  near  the  same  points  at  which  the  frequency  polygon  meets  it; 
it  rises,  slowly  at  first,  to  a  maximum,  then  recedes  again  to  the  base 
line.  The  curve  should  be  so  drawn  that  the  total  area  under  the 
curve  is  equal  to  the  total  area  of  the  histogram. 

The  graphical  representation  of  the  data  of  Table  9  brings  out  in 
bold  relief  the  outstanding  facts  that  would  possibly  not  be  noted  by 
a  glance  at  the  table.  Since  our  primary  aim  here  is  the  comparison 
of  the  two  sets  of  mortality  rates,  we  shall  superimpose  them  on  the 
same  graph  sheet.  Chart  3,  so  that  they  may  be  readily  compared.^ 

Chart  3 
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The  reader  will  note  that  the  age  divisions  of  Chart  3  are  unequal. 
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Table  9.    Mortality  Rates  per  100,000  Population  for  Typhoid 
Fever  in  the  Registration  States,  1910  and  1920  ^ 


Age 

Mortality  Rate 
in  1910 

Mortality  Rate 
in  1920 

Under  1 

5.5 

1.1 

1  to  4 

11.9 

2.4 

5  to  9 

13.0 

3.5 

10  to  14 

16.6 

5.6 

15  to  19 

31.2 

8.5 

20  to  24 

37.1 

8.0 

25  to  34 

30.4 

6.2 

35  to  44 

22.1 

5.5 

45  to  54 

20.4 

4.7 

55  to  64 

18.5 

4.7 

65  to  74 

16.9 

4.7 

75  and  on 

14.0 

2.0 

EXERCISES 

1.  The  lengths  of  a  sample  of  75  beans  were  measured  to  the  nearest 
tenth  of  a  centimeter.  The  results  are  shown  in  the  following  distribu- 
tion: 

Distribution  of  Length  of  75  Beans 


Length 

Class  Mark 
X 

Frequency 
fiX) 

1.45-1.55 

1.5 

2 

1.55-1.65 

1.6 

4 

1.65-1.75 

1.7 

6 

1.75-1.85 

1.8 

8 

1.85-1.95 

1.9 

12 

1.95-2.05 

2.0 

20 

2.05-2.15 

2.1 

11 

2.15-2.25 

2.2 

9 

2.25-2.35 

2.3 

2 

2.35-2.45 

2.4 

1 

Total 

75 

Draw  the  histogram  for  these  data.  Connect  the  mid-points  of  the  tops 
of  the  rectangles,  complete  at  the  extremes  as  previously  directed,  and 
thereby  obtain  the  frequency  polygon.  What  is  the  total  area  of  the 
histogram?  of  the  polygon? 

^  Mortality  Statistics,  1910-1920,  United  States  Bureau  of  the  Census,  p.  36. 
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2.  If  1,024  throws  are  made  with  10  coins,  theoretically,  the  following 
results  are  "  expected 


Theoretical  Frequencies  in  Coin-Tossing 


Number  of  Heads 
Turning  up 
X 

Frequency 
fiX) 

Number  of  Heads 
Turning  up 

X 

Frequency 
f(X) 

0 
1 

2 
3 
4 

1 

10 
45 
120 
210 

6 
7 
8 
9 
10 

210 
120 
45 
10 
1 

5 

252 

Tout 

1,024 

Draw  the  frequency  polygon. 

This  distribution,  we  observe,  is  symmetrical  with  respect  to  a  vertical 
line  drawn  through  the  point  (5,  0).  While  symmetrical  distributions 
never  occur  in  observed  data,  they  are  closely  approximated  in  biological 
and  anthropometric  measurements.  Many  educational  measurements 
also  result  in  series  that  possess  remarkable  degrees  of  symmetry. 

3.  As  another  example  of  a  series  of  discrete  variates,  consider  the  dis- 
tribution of  the  following  table: 


Distribution  of  Rays  in  Tail  Fins  of  703  Flounders^ 


Number  of 

Number  of 

Number  of 

Number  of 

Rays 

Flounders 

Rays 

Flounders 

X 

fiX) 

X 

fiX) 

47 

5 

55 

111 

48 

2 

56 

74 

49 

13 

57 

37 

50 

23 

58 

16 

51 

58 

59 

4 

52 

96 

60 

2 

53 

134 

61 

1 

54 

127 

Total 

703 

Draw  the  histogram  and  a  frequency  curve  for  these  data.^ 

^  Paul  Riebesell,  Biometrik  und  Variationsstatistiky  p.  760. 

2  As  with  all  discrete  variates,  this  curve  is  defined  only  at  the  points  deter- 
mined by  the  data.  We  draw  the  curve  merely  to  emphasize  the  characteristics 
of  the  distribution. 
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4.  The  data  in  the  following  table  give  the  frequencies  of  the  numbers 
of  petals  on  a  certain  series  of  the  plant  named.  They  illustrate  what  is 
called  the  J-shaped  distribution. 


Frequencies  of  Petal  Numbers, 
Ranunculus  Bulbosus 


Number  of  Petals 

Frequency 

X 

KX) 

5 

133 

6 

55 

7 

23 

8 

7 

9 

2 

10 

2 

Total 

222 

Plot  the  histogram  and  the  frequency  curve. 


17.  GRAPHICAL  REPRESENTATION  OF  TEMPORAL 

DISTRIBUTIONS 

The  distributions  we  have  thus  far  considered  have  dealt  mainly 
with  biological  and  educational  data.  They  have  not  generally 
been  primarily  related  to  time.  The  tabular  representations  have, 
in  general,  shown  few  members  at  the  extremes  but  they  have  shown 
a  comparatively  large  number  in  the  central  portions  of  the  tables. 
The  graphical  representations,  whether  by  histogram,  polygon,  or 
curve,  have  possessed  a  common  description,  namely,  low  at  each 
end  with  a  maximum  near  the  center.  We  shall  call  such  distribu- 
iis^is  mound-shaped.  ^  — "  •  

OiTtiO  disLribulloiiT;  previously  considered,  some  have  shown  a 
wide  variation  or  dispersioii,  whereas  others  have  shown  moderate 
variation.  Further,  they  have  been  more  or  less  unsymmetrically 
distributed  about  any  Une  or  point.  In  other  words,  they  have 
possessed  a  quality  of  asymmetry  or  skewness.  The  distribution  of 
algebra  grades  seemed  considerably  peaked  (leptokurtic)  near  the 
center.  This  quality  of  peakedness''  (or  ^'flatness'')  is  called  kurtosis 
and  excess.^ 


1  Kurtosis  by  the  British  school;  excess  by  the  Scandinavian  school. 
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In  the  chapters  that  follow  we  shall  develop  measures  of  these 
qualities  of  the  distributions.  Our  present  task  is  the  organization 
and  the  graphical  representation  of  the  data;  our  next  problem  will 
be  its  algebraical  and  arithmetical  analysis. 

Another  type  of  distribution  frequently  encountered  in  deaUng 
with  economic  and  mortality  data  is  that  in  which  time  is  the  inde- 
pendent variable.  Such  distributions  are  called  temporal  distribu- 
Hons  or  tim£  series, 

We  shall  note  tTiat  time  series  display  a  number  of  distinct  types 
of  movement  such  as  long-time  trends,  seasonal  variation,  cyclical 
movements,  etcetera.  These  types  of  movement  call  for  close  ex- 
amination. 

As  a  first  example,  consider  the  growth  of  population  of  the  United 
States  from  1790  to  1930  inclusive. 


Table  10.    Population:   Continental  United  States,  1790-1930  ^ 


Census 
Year 

Population 
(thousands) 

Per  Cent  of 
Increase  over 
Preceding  Census 

Census 
Year 

Population 
(thousands) 

Per  Cent  of 
Increase  over 
Preceding  Census 

1790 

3,929 

1870 

38,558 

26.6* 

1800 

5,308 

35.1 

1880 

50,156 

26.0* 

1810 

7,240 

36.4 

1890 

62,948 

25.5 

1820 

9,638 

33.1 

1900 

75,995 

20.7 

1830 

12,866 

33.5 

1910 

91,972 

21.0 

1840 

17,069 

32.7 

1920 

105,711 

14.9 

1850 

23,192 

35.9 

1930 

122,725 

16.1 

1860 

31,443 

35.6 

*  Estimated  rates  are  given  here. 

The  graphical  representation  of  these  data  is  shown  in  Chart  4. 
We  note  that  the  population  has  enjoyed  a  steady  growth.  From 
1790  to  1860  each  census  increased  approximately  one-third,  usually 
somewhat  more,  over  the  preceding;  from  1860  to  1890  the  decade 
rates  of  growth  were  somewhat  over  one-fourth,  and  from  1890  to 
1910  a  Uttle  over  one-fifth.  Since  1910  the  decade  rates  of  increase 
have  been  about  15  per  cent.  Hence  we  see  that,  whereas  the  popula- 
tion has  steadily  increased,  the  rate  of  increase  has  been  steadily 
decreasing. 

^  The  data  are  taken  from  the  Fifteenth  Census  of  the  United  States,  Bureau 
of  the  Census,  Vol.  I,  Population,  p.  6. 
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As  a  second  example  of  time  series,  consider  the  following  table 
which  gives  the  production  of  lumber  in  the  United  States  in  billions 
of  board  feet  for  the  given  years. 


Table  1 1 .    Lumber  Production  in  the  United  States  * 


Year 

Reported  Production 
(billions  of  board  feet) 

Year 

Reported  Production 
(billions  of  board  feel) 

1909 

44.5 

1916 

39.9 

1910 

40.0 

1917 

35.8 

1911 

37.0 

1918 

31.9 

1912 

39.2 

1919 

34.6 

1913 

38.4 

1920 

33.8 

1914 

37.3 

1921 

27.0 

1915 

37.0 

1922 

31.6 

The  data  are  taken  from  Statistical  Abstract  of  the  United  States^  1928,  p.  689. 
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These  data  are  represented  graphically  in  Chart  5.  This  is  a  typical 
diagram  for  a  historical  series.  The  broken  line  which  represents  the 
production  oscillates  back  and  forth  on  either  side  of  the  line  of  trend 
which  we  have  estimated  graphically.  All  the  points  for  the  produc- 
tion polygon  lie  within  a  comparatively  narrow  strip  of  which  the 
trend  line  is  the  center.  Both  the  trend  line  and  the  production 
polygon  emphasize  the  general  diminishing  of  the  production  during 
the  years  in  question.  In  a  later  chapter  we  shall  discuss  methods 
for  a  closer  analysis  of  these  data. 

Chart  5  affords  a  good  illustration  of  the  possibilities  of  omitting 
unimportant  areas.  In  order  to  give  emphasis  to  the  main  facts  of 
the  data,  we  obey  the  instructions  of  the  Joint  Committee  on  Stand- 
ards for  Graphic  Presentation  ^  to  the  effect  that  the  zero  line  should 
be  shown  by  the  use  of  a  horizontal  break  in  the  diagram. 

Chart  5 
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^  A.  C.  Haskell,  How  to  Make  and  Use  Graphic  Charts y  1919,  p.  71. 
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The  recommendations  of  this  committee  should  be  observed  when  a 
single  set  of  data  is  exhibited.  It  may  not  be  advisable  to  carry  out 
the  recommendations  when  two  sets  of  data  are  placed  upon  the  same 
graph  sheet.  We  found  it  possible  to  do  this  on  Chart  5,  but  for  the 
data  in  Table  12,  though  it  is  possible,  it  is  inadvisable. 

One  purpose  of  a  graph  is  to  emphasize  outstanding  facts,  to  make 
evident  outstanding  relationships.  To  accomplish  this,  the  proper 
scales  must  be  selected.  The  selection  of  the  scales  that  will  give  due 
emphasis  to  the  facts  and  relationships  may  not  be  the  scales  such 
that  the  zero  lines  for  both  sets  of  data  can  be  shown  on  the  diagram. 

Consider  the  data  of  Table  12.  Here  we  freely  omit,  without 
confusing  the  figure,  the  zero  lines  for  both  sets  of  data.   This  table 

gives  the  quantity  of  beef  available  for  consumption  per  capita  per 

Table  12.    Beef:  Quantity  Available  per  Capita  per  Annum 
Steers:  Price  per  Hundredweight  in  Dollars* 


Year 

Beef  Available 
{'pounds) 

Price  per  Cwt. 
(dollars) 

Year 

Beef  Available 
(pounds) 

Price  per  Cwt. 
(dollars) 

1902 

68.5 

7.47 

1910 

71.1 

7.77 

1903 

76.0 

5.57 

1911 

67.7 

7.23 

1904 

73.6 

5.96 

1912 

61.1 

9.36 

1905 

73.0 

5.97 

1913 

60.6 

8.93 

1906 

72.6 

6.13 

1914 

58.5 

9.65 

1907 

77.5 

6.54 

1915 

54.5 

9.31 

1908 

71.5 

6.82 

1916 

56.0 

10.42 

1909 

75.4 

7.34 

annum,  and  the  wholesale  price  of  steers  per  100  pounds  for  the 
given  years. 

The  graphical  representation  of  these  data  is  found  in  Chart  6. 
During  this  fifteen-year  period  we  note  that  the  quantity  of  beef 
available  per  capita  per  annum  has  generally  decreased,  whereas  the 
wholesale  price  per  hundredweight  has  almost  steadily  increased. 
That  is,  the  general  trend  of  the  quantity  available  has  shown  a 
downward  trend  whereas  the  price  has  shown  an  upward  trend. 
The  trend  for  the  quantity  available  seems  to  be  curvilinear,  while 
that  for  the  price  seems  to  be  linear.  These  trends  and  their  relation- 
ships will  be  further  analyzed  in  Chapter  8. 

*  The  data  are  taken  from  Yearbook  of  Agriculture,  1928,  p.  962;  United  State* 
Bureau  of  Labor  Statistics,  Bulletin  No.  335,  p.  38. 
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18.  CUMULATIVE  DISTRIBUTIONS  AND  CURVES 

Frequently  the  chief  interest  in  a  frequency  distribution  is  not  so 
much  in  the  items  as  they  are  distributed  in  the  several  classes  as  in 
the  accumulated  totals  of  certain  of  the  classes.  We  may,  for  ex- 
ample, be  chiefly  interested  in  the  number  of  students  who  receive 
'*more  than''  or  *'less  than"  a  given  mark;  in  the  number  of  em- 
ployees who  receive  ^^rnore  than''  or  *4ess  than"  a  given  wage;  in 
the  number  of  families  who  receive  ^'more  than"  or  *4ess  than"  a 
given  income. 

We  are  thus  led  to  a  discussion  of  cumulative  distributions  and  to 
their  graphical  representations,  known  as  cumulative  curves.^ 

Consider,  for  example,  the  distribution  of  Table  13,  which  illus- 
trates the  formation  of  a  **less  than"  distribution.  The  column 
denoted  by  Cum.  f{X)  gives  us  the  number  of  the  given  sample  who 
receive  an  income  less  than  a  given  amount,  and  the  column  denoted 

^  The  cumulative  curve  is  sometimes  called  an  ogive. 
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Table  13.   Distribution  of  the  Estimated  Income  among 
Unmarried  Women  of  the  United  States  in  1910  ^ 


Income 
(dollars) 

Number 
fiX) 

Income 
less  than 
(dollars) 

Cum.  f(X) 

Cum.  f(X) 
N 

0-  200 
200-  300 
300-  400 
400-  500 
500-  600 
600-  700 
700-  800 
800-  900 
900-1,000 
1,000-1,100 
1,100-1,200 
1,200-1,300 
1,300-1,400 

10  . 

70 
560 
530 
280 
150 
110 

37 

22 

16 

12 
8 
5 

200 
300 
400 
500 
600 
700 
800 
900 
1,000 
1,100 
1,200 
1,300 
1,400 

10 
80 
640 
1,170 
1,450 
1,600 
1,710 
1,747 
1,769 
1,785 
1,797 
1,805 
1,810 

0.006 
0.044 
0.354 
0.646 
0.801 
0.884 
0.945 
0.965 
0.977 
0.986 
0.993 
0.997 
1.000 

Total 

1,810 

by 


Cum.  f(X) 


enables  us  to  note  the  per  cent  of  the  total  frequency, 

A^,  who  receive  less  than  a  given  amount.  Thus,  of  the  1,810  in- 
comes considered  in  the  sample,  640  or  35  per  cent  received  less  than 
$400;  1,600  or  88  per  cent  received  less  than  $700;  1,769  or  98  per 
cent  received  less  than  $1,000. 

The  diagram  for  the  cumulative  distribution  of  Table  13  is  con- 
structed by  plotting  the  points  (200,  10),  (300,  80),  etc.,  as  in  elemen- 
tary algebra,  and  joining  them  by  a  broken  line  as  in  Chart  7.  If  a 
smooth  curve  be  drawn  through  the  points  plotted  we  have  a  cumu- 

Cum  f(X) 

lative  curve.  The  graph  of  is  precisely  coincident  with 

that  of  Cum.  f(X)  if  a  proper  scale  is  used  for  the  ordinates. 

The  cumulative  curve  is  useful  for  the  process  of  interpolation, 
that  is,  for  estimating  values  between  those  given  in  the  table. 
Suppose,  for  example,  we  desire  to  know  the  income  such  that  half 
of  the  1,810  have  incomes  less  than  it  and  half  have  larger  incomes. 
Such  an  income  is  called  the  median  -  income, 

^  W.  I.  King,  Wealth  and  Income  of  the  People  of  the  United  States y  1915,  p.  224. 
2  The  median  will  be  discussed  in  Chapter  3. 
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Chart  7 
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We  have: 


=  1,810 

N 
2 


=  905 


Hence  our  question  is:  What  is  the  income  when  the  cumulative 
number  of  incomes  is  905? 

Mark  the  point  905  on  the  vertical  scale.  Draw  through  this 
point  a  horizontal  Une  which  meets  the  cumulative  polygon  at  A. 
Draw  through  A  a  vertical  line  which  meets  the  horizontal  axis  at  B 
for  which  the  income  is  about  $450. 

We  can  check  this  by  simple  proportion.  From  Table  13,  we 
have: 
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Income 
less  than 


^$400 


100 


■^Median  =  400  +  a: 


-»$500 


Cum.  f{X) 


640^ 


'265 


905<- 


1,170<- 


530 


X 


265 


100  530 

X  =  $50 
Median  =  $400  +  x 


$450  (Approx.) 


In  general  the  proportion  is  written: 

Partial  difference  in  1st  column  _  Partial  difference  in  2nd  column 
Total  difference  in  1st  column  "  Total  difference  in  2nd  column 

EXERCISE 

Estimate  from  Chart  7  the  income  such  that  it  is  exceeded  by  exactly 
three-fourths  of  the  1,810  incomes.  Check  your  estimate  by  algebraical 
interpolation.    (You  should  secure  about  1367  for  your  result.) 

In  a  manner  similar  to  the  formation  of  the  ''less  than''  distribu- 
tion we  may  form  a  ''more  than''  distribution.  While  the  "less 
than"  distribution  proceeds  from  the  least  variates  and  refers  to  the 
upper  limits  of  the  classes,  the  "more  than"  distribution  proceeds 
from  the  greatest  variates  to  the  least  and  refers  to  the  lower  limits 
of  the  classes. 

19.  TYPES  OF  FREQUENCY  CURVES 

As  the  student  proceeds  with  the  gniphical  representation  of 
frequency  distributions  he  will  be  impressed  with  the  fact  that  the 
graphs  of  data,  even  when  collected  from  widely  different  fields, 
show  certain  common  characteristics,  and  can  therefore  be  described 
as  belonging  to  certain  general  types;  in  fact,  many  of  the  frequency 
curves  can  be  closely  represented  by  equations. 

The  problem  of  representing  the  several  types  of  frequency  dis- 
tributions by  equations  that  will  best  fit  the  data  belongs  to  the 
field  of  advanced  statistics,  and  we  desire  merely  to  suggest  it  at  this 
point.  We  do,  however,  wish  to  describe  briefly  the  general  types  of 
frequency  curves  that  are  most  common. 
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By  far  the  most  common  of  all  is  the  moderately  asymmetrical  or 
mound-shaped  distribution.  It  occurs  in  data  collected  from  many 
fields,  such  as  education,  psychology,  sociology,  economics,  biology. 
The  frequencies  of  this  general  type  increase  more  or  less  regularly 
up  to  a  maximum,  and  then  decrease  in  the  same  way.  We  also 
note  in  this  type  a  piling  up  of  cases  near  the  center,  that  is,  a  central 
tendency.   On  Chart  8  we  illustrate  this  type  by  curves  ai  and  a2. 

Chart  8 


Types  of  Frequency  Curves 


The  second  type  we  may  name  the  symmetrical  distribution^  in 
which  the  frequencies  decrease  uniformly  on  either  side  of  a  line 
through  the  center.  It  is  frequently  approached  in  form  by  data 
derived  from  coin-  and  dice-throwing  experiments;  from  errors  of 
observation  in  physical  measurements;  from  biological  measure- 
ments; from  educational  measurements.  The  graph  for  this  second 
type  is  illustrated  by  curve  b  on  Chart  8. 

Many  writers  call  this  second  type  a  normal  or  bell-shaped  disiribu- 
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tion.  We  much  prefer  to  reserve  the  name  normal  for  the  special 
symmetrical  distribution  whose  equation  is 

y  =  Ce-h^-' 

and  which  we  shall  discuss  rather  fully  in  Chapter  12.  There  are 
many  equations  for  the  curve  of  the  general  form  we  are  describing 
here  (see  Exercises  6,  7,  8,  and  9  at  the  end  of  this  chapter)  but  a 
curve  is  normal  only  when  its  equation  has  the  above  form. 

The  third  type  is  the  ^-sha'ped  distribution  in  which  the  frequency 
constantly  increases  or  constantly  decreases.  The  greatest  frequency 
is  at  one  end  of  the  distribution  or  the  other.  It  is  illustrated  by 
curve  c  on  Chart  8.   [See  Exercise  4,  page  43.] 

The  fourth  typo  we  shall  mention  is  the  [ishaped  distribution 
(see  curve  d,  Chart  8).  It  is  rarely  met.  The  stock  example,  familiar 
to  statisticians,  is  given  in  Exorcise  11  at  the  end  of  this  chapter. 

20.  SUGGESTIONS  FOR  TABULAR  AND  GRAPHIC 

PRESENTATION 

The  process  of  arranging  data  into  columns  and  rows  in  an  orderly 
manner  is  called  tabulation.  The  essential  characteristics  of  a  good 
tabulation  arc  clearness  and  compactness.  While  no  hard  and  fast 
rules  can  be  given  to  cover  all  cases  of  table  construction,  the  follow- 
ing suggestions  may  be  found  helpful:  ^ 

1.  The  table  should  have  a  clear  and  concise  title. 

2.  The  cohimns  and  the  rows  should  be  arranged  in  an  order  that  will 
facilitate  comparisons. 

3.  The  columns  should  have  concise  headings  stating  the  units  of 
measurement  when  necessary. 

4.  The  forms  should  be  set  off  by  double  lines  at  the  top  ard  the 
bottom,  the  sides  remaining  open.^ 

5.  The  totals  may  be  placed  above  or  below  the  detail  which  they 
summate. 

6.  If  possible,  the  source  of  the  data  should  be  given. 

In  the  C(mstruction  of  the  charts,  wo  should  note  especially  that: 

1.  A  boundary  (picture  frame)  improves  the  appearance  of  the  picture. 

2.  A  clear  title  and  subtitle  should  be  in  evidence. 

^  A  splendid  treatment  of  tabular  representation  is  found  in  Horace  Secrist, 
An  Introduction  to  Statistical  Methods,  rev.  ed.,  1925,  Chap.  VI. 
2  A  narrow,  compact  table  may  have  side  lines. 
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3.  The  scales  should  be  so  selected  that  the  main  facts  are  given  due 
emphasis. 

4.  The  horizontal  and  vertical  scales,  with  suitable  captions,  should 
be  easily  interpreted. 

EXERCISES 

1.  The  following  table  gives  the  distributions  of  heights  and  weights  of 
1,515  first-year  university  men.  What  are  the  class  boundaries  of  the 
several  classes  in  each  distribution?  Construct  the  histograms  and  fre- 
quency curves  for  these  distributions.  To  what  general  types  do  these 
distributions  belong? 


Distribution  of  Heights  and  Weights  of  1,515  Men 

(a)  ^  (b) 

Heights  in  Inches  Weights  in  Pounds 


Class  Mark 

Frequency 

Class  Mark 

Frequency 

X 

}{X) 

X 

f{X) 

58 

2 

95.5 

5 

59 

1 

105.5 

34 

60 

7 

115.5 

139 

61 

10 

125.5 

300 

62 

26 

135.5 

367 

63 

40 

145.5 

319 

64 

74 

155.5 

205 

65 

142 

165.5 

76 

66 

220 

175.5 

43 

67 

230 

185.5 

16 

68 

258 

195.5 

3 

69 

231 

205.5 

4 

70 

118 

215.5 

3 

71 

99 

225.5 

1 

72 

38 

Total 

73 

15 

1,515 

74 

2 

75 

1 

76 

0 

77 

1 

Total 

1,515 

^.  The  following  table  gives  the  diiStribution  of  head-breadths  of  1,000 
Cambridge  men,  the  measurements  being  taken  to  the  nearest  tenth  of  an 
inch.  Draw  the  histogram  and  the  frequency  polygon  for  these  data. 
What  is  the  general  type  of  this  distribution?  Find  the  class  boundaries. 
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Distribution  of  Heab-Breadths  of  1,000  Men  ^ 


Class  Mark 
X 

Freoueticu 
fiX) 

5.5 

3 

5.6 

12 

5.7 

43 

5.8 

80 

5.9 

131 

6.0 

236 

6.2 

142 

6.3 

99 

6.4 

37 

6.5 

15 

6.6 

12 

6.7 

3 

6.8 

2 

Total 

1,000 

3.  In  the  following  table,  the  average  price  per  bushel  is  that  received 
by  producers  December  1. 

Average  Yield  and  Average  Price  of  Wheat,  1919-1928  2 


Year 

Average  Yield 
per  Acre 
(bushels) 

Average  Price 
per  Bushel 
(cents) 

1919 

12.8 

214.9 

1920 

13.6 

143.7 

1921 

12.8 

92.6 

1922 

13.9 

100.7 

1923 

13.4 

92.3 

1924 

16.5 

129.9 

1925 

12.9 

141.6 

1926 

14.8 

119.8 

1927 

14.9 

111.5 

1928 

15.6 

97.2 

Make  a  chart  of  these  data  representing  both  the  average  yield  and"  the 
average  price  on  the  same  diagram.    What  can  you  say  about  the  trends? 

*  The  data  are  taken  from  Biometrika,  Vol.  I,  p.  220. 

*  The  data  are  taken  from  Yearbook  of  Agriculture,  1928,  p.  670. 
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4.  In  the  following  table  the  average  price  per  barrel  is  that  of  Baldwins 
at  Boston. 

Total  Production  of  Apples  in  the  United  States  and  Average 

Price,  1910-19261 


Year 

Production 
(millions  of 
bushels) 

Price 
per  Barrel 
(dollars) 

Year 

Production 
(millions  of 
bushels) 

Price 
per  Barrel 
(dollars) 

1910 

142 

3.68 

1919 

142 

6.71 

1911 

214 

2.56 

1920 

224 

4.02 

1912 

235 

2.28 

1921 

99 

6.69 

1913 

145 

3.95 

1922 

203 

4.84 

1914 

253 

2.08 

1923 

203 

4.02 

1915 

230 

2.36 

1924 

172 

4.78 

1916 

194 

3.44 

1925 

172 

3.92 

1917 

167 

4.40 

1926 

247 

3.22 

1918 

170 

5.94 

Make  a  chart  of  these  data  representing  both  the  production  and  the 
price  on  the  same  diagram.  Point  out  an  important  relationship  that  is 
emphasized  by  the  graph. 

6.  Make  a  chart  of  the  data  of  the  following  table.  Describe  the  trend. 

Divorces  in  the  United  States  - 


Year 

Number  of  Divorces 
(thousands) 

Year 

Number  of  Divorces 
(thousands) 

1890 

33.5 

1916 

112.0 

1895 

40.4 

1922 

148.8 

1900 

55.8 

1926 

180.0 

1905 

68.0 

Plot  the  curve  for  each  of  the  following  equations,  and  describe  the 
general  type  to  which  each  curve  belongs. 


6.  y 


100 


2^  +  2-^ 


7.  y 


100 


8.  y  =  50  (2)-T 


9.  2/  =  50  1  - 


r6> 


10.  ,  =  5o(i+5y(i-0 


^  Loc.  dt.y  p.  764. 

2  The  data  are  taken  from  Statistical  Abstract  of  the  United  StaieSf  1930, 
p.  92. 
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!!•  Draw  a  histogram  to  represent  the  data  of  the  following  table. 

Fkequencies  in  Days  of  Estimated  Intensities 
OP  Cloudiness  at  Breslau,  1876-1885 


C  louamess 

Frequency 

0 

751 

1 

179 

107 

3 

69 

4 

46 

5 

9 

6 

21 

7 

71 

8 

194 

9 

117 

10 

2,089 

Total 

3,653 

12.  The  following  table  gives  the  annual  production  of  Portland  Cement 
in  the  United  States.  Statistical  Abstract  of  the  United  States,  1930,  p.  785. 
Construct  a  broken-line  diagram  for  the^e  (hita.  Is  the  general  trend  up- 
ward or  downward?  linear  or  curvilinear? 


Portland  Cement  Production 


Year 

Production 
{Milhons  of 
bat  re  Is) 

Year 

Production 
{Millions  of 
barrels) 

1910 

77 

1920 

100 

1911 

79 

1921 

99 

1912 

82 

1922 

115 

1913 

92 

1923 

137 

1914 

88 

1924 

149 

1915 

86 

1925 

161 

1916 

92 

1926 

165 

1917 

93 

1927 

173 

1918 

71 

1928 

176 

1919 

81 

1929 

171 

13.  The  following  table  gives  the  annual  production  of  cigarettes  in 
the  United  States.  Construct  a  broken-line  diagram  for  these  data.  Is 
the  trend  of  production  linear  or  curviUnear? 
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Cigarette  Production 


Year 

Annual  Production 

\£>lUl07hS  ) 

Year 

Annual  Production 

\DlLLl0nS  ) 

1920 

47.4 

1925 

82.2 

1921 

52.1 

1926 

92.1 

1922 

55.8 

1927 

99.8 

1923 

66.7 

1928 

108.7 

1924 

72.7 

1929 

122.3 

CUMULATIVE  REVIEW 

1.  Name  several  fields  of  investigation  that  make  use  of  the  statistical 
method. 

2.  Name  the  four  steps  in  the  solution  of  a  statistical  problem  and 
state  briefly  what  each  means. 

3.  What  is  meant  by  variation  in  statistical  data? 

4.  Define  continuous  variates;  discrete  variates.  Illustrate. 

6.  What  is  meant  by  "error  in  a  measurement"?  The  relative  error 
in  a  measurement?  Illustrate. 

6.  Usually  what  is  the  object  of  a  statistical  analysis? 

7.  Distinguish  between  sample  and  parent  population. 

8.  Can  you  think  of  a  problem  in  which  the  primary  object  is  a  sum- 
marized numerical  description  of  the  sample  only? 

9.  What  letter  do  we  use  to  designate  the  total  frequency,  2/(X)? 

10.  In  the  terminology  of  the  text,  what  do  the  symbols  X,  f(X),  and 
N  represent? 

11.  Give  directions  for  constructing  a  histogram;  a  frequency  polygon. 

12.  Prove :  The  total  area  of  a  histogram  equals  the  total  frequency,  Ny 
times  the  class  width,  w.  That  is. 

Area  =  =  wN, 

13.  What  is  an  ogive?    Mention  several  uses  of  the  ogive. 
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MEASURES  OF  CENTRAL  TENDENCY 

21.  INTRODUCTION 

It  will  be  recalled  that  after  the  collection  of  the  data  the  next  step 
in  the  solution  of  a  statistical  problem  is  the  organization  of  the  data. 
The  preceding  chapter  has  been  devoted  to  the  problem  of  organi- 
zation of  the  data  and  its  graphical  analysis.  This  brings  us  to  the 
third  step,  the  numerical  analysis  of  the  data.  We  shall  find  it  neces- 
sary to  devote  several  chapters  to  this  important  part  of  our  problem. 

The  primary  purpose  (see  Section  1)  of  a  statistical  analysis  is  to 
abstract  the  relevant  information  from  a  mass  of  numerical  data  and  to 
express  the  results  clearly  and  concisely.  We  accomplish  this  purpose 
by  computing  certain  summarizing  numbers,  or  averages,  which  are 
simply  statistical  constants^  rigidly  defined,  and  which  are  designed, 
as  Professor  Bowley  says,  'Ho  enable  the  human  mind  to  comprehend 
with  a  single  effort  the  significance  of  the  whole." 

Averages  may  be  used  not  only  to  give  us  a  concise  picture  of  a 
large  group  of  numbers,  but  they  may  be  used  also  to  compare  dif- 
ferent groups,  to  obtain  important  facts  about  a  large  universe  (the 
parent  population)  from  the  measurements  of  a  sample,  to  measure 
the  relationship  between  different  groups. 

The  present  chapter  will  be  devoted  to  the  averages  which  measure 
central  tendency.^  We  shall  give  attention  to  five  such  measures: 
(1)  the  arithmetic  mean,  (2)  the  median,  (3)  the  mode,  (4)  the 
geometric  mean,  and  (5)  the  harmonic  mean. 

While  we  shall  not  at  this  time  undertake  to  judge  the  relative 
merits  of  these  measures,  we  may  with  propriety  mention-  several 
criteria  by  which  an  average  may  be  fairly  judged.  Yule  has  men- 
tioned several  properties  that  an  average  should  possess.^  He  says 
that  an  average  (1)  should  be  rigidly  defined,  (2)  should  be  based  on 
all  the  observations,  (3)  should  be  readily  comprehensible,  (4)  should 
be  easily  computed,  (5)  should  be  affected  as  little  as  possible  by  the 

1  Averages  that  measure  other  characteristics  will  be  discussed  in  succeeding 
chapters.  *  Yule  and  Kendall,  op.  cit,,  p.  113. 
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Mx 


(I) 


fluctuations  due  to  sampling,^  and,  finally,  (6)  should  lend  itself  readily 
to  algebraic  treatment. 

22.  THE  ARITHMETIC  MEAN,  Mx 

The  arithmetic  mean  of  a  group  of  numbers,  essentially  measure- 
ments, is  their  sum  divided  by  their  number.  For  example,  the 
arithmetic  mean  of  the  numbers  3,  5,  8,  13,  6  is  given  by: 

^^^^^3  +  5  +  8  +  13  +  6^^ 

5 

In  algebraic  form,  if  Xi,  X2,  X3,  .  .  .  ,  Xn  is  a  set  of  variates, 
their  arithmetic  mean  is  given  by: 

Xi  H~  X2  4"  •  •  •  +       _  2X 

jv  77" 

It  may  happen  that  many  of  the  variates  may  be  equal.  Suppose 
the  grades  of  23  students  on  a  certain  test  were:  96,  92,  92,  85,  85, 
85,  85,  76,  76,  76,  76,  76,  76,  65,  65,  65,  65,  60,  60,  60,  50,  50,  40.  Of 
course  we  can  find  the  arithmetic  mean  by  the  above  formula  and 
definition,  but  the  arithmetic  is  simplified  if  we  proceed  as  follows: 

^  96(l)  +  92(2)+85(4)  +  76(6)  +  65(4)  +  60(3)  +  50(2)+40(l) 

23 

=  '^^g^  =  72  centigrade  units  (c.u.) 

The  numbers,  1,  2,  4,  6,  4,  3,  2,  1  are  the  frequencies  of  the  grades. 
We  can  show  this  arithmetic  mean  by  simply  arranging  the  above  in 
tabular  form,  thus  giving  the  frequency  distribution. 

Table  14.  Frequencies  of  Grades  of  23  Students 


Grade 
X 

Frequency 
fiX) 

Xf{X) 

96 

1 

96 

92 

2 

184 

85 

4 

340 

76 

6 

456 

65 

4 

260 

60 

3 

180 

50 

2 

100 

40 

1 

40 

Total 

23 

1,656 

^  See  Section  37  for  an  explanation. 
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1656 
=  -23~  =  ^2  c.u. 

In  general,  suppose  that  Xi  appears  f{Xi)  times,  that  X2  appears 
/(X2)  times,  and  so  on,  and  that  Xn  appears  f{Xn)  times,  then  evi- 
dently: 

=  -7;   =  —jH  (2) 

^f(Xi) 
1=  1 

where  A''  =  =  the  total  frequency  =  the  number  of  the 

measures.   The  table  headings  for  formula  (2)  should  be 


fiX) 


Xf{X) 


as  is  illustrated  by  Table  14. 

At  this  point  a  few  words  about  our  notation  are  in  order.  We 
have  indicated  our  measurements,  scores,  etc.  by  the  upper  case  X. 
Since  the  arithmetic  mean  is  generally  called  the  mean,  we  may 
naturally  represent  the  mean  by  M.  The  subscript  X  gives  emphasis 
to  the  fact  that  we  are  averaging  values  of  X.  If  the  original 
items  are  indicated  by  Y,  or  by  Z,  the  corresponding  means  may  be 
represented  by  My  and  Mz  respectively.  We  shall  find  it  necessary 
to  use  the  subscript  only  when  dealing  with  problems  of  theory  or 
when  we  wish  to  emphasize  what  variable  we  are  averaging.  Hence, 
in  general  we  shall  indicate  the  mean  by  il/  without  the  subscript. 

In  the  preceding  chapter,  when  discussing  the  formation  of  fre- 
quency distributions,  our  attention  was  directed  to  two  important 
assumptions  that  we  must  make  regarding  our  data.   We  assume: 

1.  That  in  any  class  the  measures  are  uniformly  distributed  throughout 
the  interval; 

2.  That  the  frequency  of  the  class  may  be  concentrated  at  its  mid-point. 

We  shall  see  that  in  most  cases  the  error  due  to  grouping  is  relatively 
slight  and  that  even  this  can  frequently  be  adjusted  by  certain  cor- 
rections.^ 


^  Sheppard's  Corrections,  Section  43  of  this  volume. 
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It  is  evident,  as  indicated  above,  that  the  use  of  formula  (2)  for 
computing  M  requires  columns  for  X  the  class  mark,  for  f{X)  the 
frequency,  and  for  Xf{X),  The  column  for  the  class  intervals  may 
or  may  not  be  included  at  the  pleasure  of  the  computer. 

As  another  illustrative  example,  we  shall  compute  M  for  the 
distribution  of  grades  ^  in  college  algebra  previously  exhibited  in 
Table  8.   (Note  the  application  of  assumption  2  above.) 

Table  15.  Grades  in  College  Algebra:  Computing  M 


X 

XfiX) 

95 

4 

380 

90 

6 

540 

85 

12 

1,020 

80 

19 

1,520 

75 

37 

2,775 

70 

24 

1,680 

65 

11 

715 

60 

6 

360 

55 

4 

220 

50 

2 

100 

Total 

125 

9,310 

M  =  =  74.48  c.u. 


The  sum  of  the  original  grades  in  Table  6  is  9,313,  thus  giving  the 
Arithmetic  mean  from  the  ungrouped  data  to  be  74.504.  In  either 
case  if  the  values  are  rounded  to  one  decimal  place  we  have: 

M  =  74.5  c.u. 

The  extreme  closeness  of  the  two  results  is  accounted  for  by  our 
choosing  as  mid-points  of  the  class  intervals  the  values  60,  65,  70,  76, 
etc.,  at  which  the  original  data  were  heavily  loaded. 

23.  THE  ARITHMETIC  MEAN  AS  A  MOMENT 

The  term  moment  is  one  which  the  statisticians  have  borrowed 
from  the  subject  of  mechanics,  where  *Hhe  moment  of  a  force  is  its 
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tendency  to  produce  rotation/'  Thus  if  we  have  a  weight  of  10  pounds 
suspended  from  a  horizontal  bar  at  a  point  5  feet  from  the  fulcrum  0, 

Diagram  2 
<   5  ft.  > 


0  i 

10  lbs. 

the  first  moment  ^  of  the  force  about  *0  is  10  X  5  =  50  foot  pounds, 
which  is  the  tendency  of  the  force  to  produce  clockwise  rotation 
about  0.  If  we  have  two  weights  of  25  pounds  and  10  pounds  sus- 
pended at  distances  of  4  feet  and  5  feet  respectively  from  and  on  the 
same  side  of  0,  the  total  first  moment  of  the  two  forces  about  0  is 
25  X  4  +  10  X  5  =  150  foot  pounds,  which  is  the  tendency  of  the 
two  forces  to  produce  clockwise  rotation  about  0. 

Diagram  3 

<  5  ft.   > 


0<  4  ft. 


25  lbs.   10  lbs. 

It  is  evident  that  a  single  force  of  50  pounds  suspended  3  feet  from 
0  on  the  same  side  would  produce  the  same  turning  effect.  But 

Diagram  4 


Oi  3  ft. 


50  lbs. 


where  could  we  suspend  both  the  10-pound  and  the  25-pound  weights 
(or  a  single  35-pound  weight)  in  order  that  they  would  produce  the 
same  first  moment  as  the  25-pound  and  the  10-pound  weights  when 
located  as  above? 

Diagram  5 

0 

■»■    ■     ■\  1  1-   I — 

^  X  ft.  ] 

35  lbs. 


1  If  the  weight  be  multiplied  by  the  square  of  the  distance,  the  product  is  called 
the  second  moment  of  the  force  about  0. 
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We  evidently  have  the  equation  : 

35a:  =  150 


from  which 


X  =  0  ft. 


Let  us  now  consider  the  frequencies  as  weights  or  forces  acting  at 
the  distances  from  0  determined  by  their  class  marks  as  the  figure 
indicates.  The  total  clockwise  turning  effect  (the  first  moment)  of 
all  the  frequencies  is: 

That  is,  the  total  first  moment  of  the  several  frequencies  is  SX/(Z). 


Diagram  6 
Xi        X2  X3 


X, 


0 


/(Xi)    f{X,)  /(X3) 


Now  where  can  the  sum  of  the  frequencies,  2/(X)  or  A^,  be  sus- 
pended in  order  that  it  may  produce  the  same  turning  effect?  Evi- 


DlAGRAM  7 

 M  

0 

<  M  > 

Xf(X) 

dently  at  the  point  M,  since  by  (2) : 

Hence  M  is  that  point  in  the  X  scale  at  which  the  total  frequency 
may  be  suspended  so  that  the  first  moment  of  the  total  frequency  about  O 
equals  the  total  first  moment  about  O  of  the  several  frequencies. 

Let  us  look  at  this  matter  now  from  a  statistical  rather  than  from 
a  statical  point  of  view.  The  preceding  discussion  has  really  been 
concerned  with  statical  moments.  By  Theorem  II  of  Section  4 
(p.  9),  we  can  write: 
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M  ^  IT 

f(X) 

If  we  suspend  the  quantities    ^  ?  z  =  1,  2,  3,  .  .  .,  n,  at  the 

points  designated  by  the  class  marks,  it  is  evident  that  M  is  the 
tendency  of  the  several  frequencies,  when  each  is  divided  by  AT,  to 
produce  rotation  about  0;  that  is,  M  is  the  first  statistical  moment 
about  O. 

Diagram  8 


X, 

X, 

Xn 

0 

i 

i 

I 

/(Xn) 

N 

N 

N 

N 

The  ?ith  statistical  moment  of  a  frequency  distribution  about  any 
point  A  is  defined  as: 

where  (t  is  the  distance  from  A  to  X^  and/(X,)  is  the  frequency  cor- 
responding to  X^,. 

Another  simple  but  important  moment  property  of  the  arithmetic 
mean  is  containc^d  in  the  theorem:  the  first  moment  of  a  distribution 
about  the  arithmetic  mean  is  zero. 

We  shall  indicate  the  deviation  of  any  measure  from  the  arithmetic 
mean  by  x.  That  is,  a*,  =  A",  —  M.  Applying  Theorems  I  and  II 
of  Section  4,  and  formula  (2)  we  have 

First  moment  about  M  =  2a:         =     2(X  -  M)f{X) 

=  1  [2X/(X)  ^  iI/2/(X)] 

=  ^[MiV  -  MiV]  =  0 
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Of  course  the  corollary  immediately  follows  that 

2x/(X)  =  0 

The  arithmetic  mean  is  then  the  center  of  balance''  or  the  ''center 
of  equilibrium''  of  the  frequencies.  That  is,  it  is  the  point  about 
which-  the  frequencies  suspended  as  weights  will  balance  or  be  in 
equilibrium. 

Let  us  examine  the  formula  ^xf{X)  =  0  for  its  algebraic  meaning 
to  statistics.  Each  re  is  a  deviation  of  a  corresponding  X  from  the 
arithmetic  mean:  that  is,  Xi  =  Xi  —  M.  Since  each  x  occurs  f{X) 
times,  the  quantity  2x/(X)  gives  the  algebraic  sum  of  the  deviations 
from  the  arithmetic  mean  of  measures  grouped  in  a  frequency  distri- 
bution. Thus  we  have  the  theorem:  if  N  measures  are  arranged  in 
a  frequency  distribution^  the  algebraic  sum  of  the  deviations  from  M  is 
zero.   [See  Exercise  5  of  the  next  list.] 

To  illustrate  these  important  properties,  consider  the  following 
distribution: 


Table  1G 


X 

fiX) 

Xf{X) 

X  =  X  -  70 

xf{X) 

82.5 

1 

82.5 

12.5 

12.5 

77.5 

3 

232.5 

7.5 

22.5 

72.5 

8 

580.0 

2.5 

20.0 

67.5 

10 

675.0 

-  2.5 

-  25.0 

62.5 

4 

250.0 

-  7.5 

-  30.0 

Total 

26 

1820.0 

00.0 

By  formula  (2)  we  find 

M  =         =  70 
26 

The  second  part  of  the  table  follows  immediately. 

We  may  see  what  this  theorem  means  graphically  by  considering 
the  following  diagram.  We  see  that  the  counter-clockwise  moment 
(turning  effect)  about  M(=  70)  is  balanced  by  the  clockwise  mo- 
ment about  this  point  for  the  clockwise  moment  equals  +  55  and 
the  counter-clockwise  moment  equals  —  55.  Thus,  the  point  70  is 
the  center  of  equilibrium. 


THE  ARITHMETIC   MEAN  AS  A  MOMENT  67 

Diagram  9 


10 


8 


60 

65 

70 

75 

80 

85  X 


7.5 

-  2.5  0 

2.5 

7.5 

12.5 

I 

I 

i 

i 

i 

4 

10 

8 

3 

1 

These  moment  considcirations  have  led  some  authorities  to  call 
the  arithmetic  mean  for  grouped  data,  as  defined  by  Formula  (2), 
the  weighted  arithmetic  mean.  In  contradistinction  the  arithmetic 
mean  for  ungrouped  data,  as  defined  by  Formula  (1),  they  call  the 
unweighted  arithmetic  mean. 

Note.  The  following  list  of  exercises  is  given  primarily  to  prepare  the 
student  for  a  facile  reading  of  Section  24.  The  several  exercises  should 
be  solved  in  detail. 


EXERCISES 


1. 


X 

KX) 

2.5 

2 

5.0 

4 

7.5 

8 

10.0 

4 

12.5 

2 

Draw  a  moment-diagram  (see  Diagram  8,  page  65) 
for  the  data  of  the  adjacent  distribution.  Compute 
the  first  statistical  moment  about  0  of  these  data. 
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2.  Compute  M  for  the  distribution  of  lengths  of  beans  described  in  Exer- 
cise 1  on  page  41.  In  what  unit  is  M  measured? 

3.  Compute  M  for  the  distribution  described  in  Exercise  3  on  page  42. 
In  what  unit  is  M  measured? 

4.  Complete  the  following: 


=  1 

-  M 

=  1  -  (4) 

=  ( 

~3) 

=  Xi 

=  2 

X2 

-  M 

=  2  ~  (  ) 

=  ( 

) 

=  Xz 

Xz 

=  4 

Za 

-  M 

=  ( 

) 

=  Xz 

X, 

=  5 

^4 

-  M 

=  ( 

) 

=  X4 

X, 

=  8 

X, 

-  M 

=  8  -  (  ) 

=  ( 

) 

=  Xs 

sz 

=  (  ) 

XX 

-  5M 

0 

—  2x» 

M 

=  (4) 

5.  In  our  notation      designates  the  deviation  of  Xi  from  M,  i.e., 


Xi  -  M  =  xi 
X2  -  M  = 


x^  —  Xi  —.M.  Completing  the  adjacent  table  we 
arrive  at  the  theorem:  the  algebraical  sum  of  the 
deviations  of  a  group  of  numbers  from  their  arithmetic 
mean  is  zero. 


Xn  -M  ^ 

i:x    NM  = 

XX  -  NO  = 

0  ^Xx 

6.  We  may  frequently  save  labor  in  statistical  computations  by  referring 
the  numbers  to  some  new  origin.  Consider  the  numbers  X:  20,  25,  28, 
30,  32.  Referred  to  X  =«=  25  as  origin  these  numbers  become  U:  —  5,  0, 
3,  5,  7. 

Mr 


r 

20 


{X  scale) 


T" 

25 


28 


30 


"1 

32 


-5 


{U  scale) 
XU 


0 


Mu  3 


-5  +  0  +  3  +  5  +  7  _  10  _  ^ 


5 


on  the  U  scale  which  corresponds  to  27  on  the  X  scale.   That  is,  Mx  =  27. 

7.  The  U  and  X  in  Number  6  are  evidently  connected  by  the  relation 
17  =  X  —  25,  or  Z  =  U  +  25.  Replace  X  by  this  value,  in  formula  (1), 
page  60,  and  show  that 

2X  .  XU 
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8. 


X 

C7  =  X  -  25 

20 

(  ) 

25 

(  ) 

28 

(  ) 

30 

(  ) 

32 

(  ) 

(  )  =  SC7 

Complete  the  table  and  find  Mx  from  the  formula 
derived  in  Number  7. 


9.  Find  Mx  of  the  numbers  315,  330,  345,  360,  375,  395,  400,  by  selecting 
the  new  origin  at  X  =  350.   Proceed  as  follows: 

First:  Derive  the  formula  for  Mx  using  U  =  X  —  350  and  A'  =  7. 
Second:  Prepare  the  table  and  substitute  in  the  derived  formula. 

10.  Find  Mx  of  the  numbers  228,  232,  234,  236,  238,  240,  243,  247, 
by  selecting  the  new  origin  at  X  =  240. 

11.  Find  Mx  of  the  numbers  215, 230,  245, 260, 275,  295,  300,  by  selecting 
the  new  origin  at  X  =  250. 

12.  Find  Mx  of  the  numbers  523,  534,  536, 538, 540,  543, 547,  by  selecting 
the  new  origin  at  X  =  540. 

13.  Can  you  think  of  two  simple  ways  to  find  Mx  for  the  numbers  75, 
150,  225,  375?   Explain  them. 

X 

Hint:  Let  U  =  —  or  X  =  75(7,  and  proceed. 

14.  Find  Mx  of  the  numbers  128,  256,  384,  512,  640,  768.  Note  that 
each  number  is  divisible  by  128. 

15.  Prove  that  if  U  =  X  —  A  or  X=U  +  Aj  A  being  a  constant, 
then,  using  (1), 

Mx  =  A  +  ~  =  A  +  Mu 


16. 


Class 

X 

/(X) 

,     X  -20 
^  =  5 

xJiX) 

2.5-  7.5 

5 

2 

7.5-12.5 

10 

6 

12.5-17.5 

15 

11 

17.5  -  22.5 

20 

16 

22.5  -  27.5 

25 

10 

27.5  -  32.5 

30 

6 

32.5-37.5 

35 

3 

Totals 

54 
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(1)  Complete  columns  4  and  5,  and  find  I>z'f{X). 

X  —  20 

(2)  Note  that  =  — - —  or  Z  =  5x'  +  20.  Replace  X  by  this 
value  in  formula  (2),  page  61,  and  show  that 

=  20  + 

54 

(3)  Substitute  and  find  Mx. 

17.  Prove  that  ii  U  =  -r^  ^  being  a  constant,  then,  using  (1), 

fC 

Mx  =         =  kMu 

18.  If  X=U  +  h   or    U  =  X'-'h, 
h  being  a  constant,  show  from  equation  (2)  that: 

2f//(Z) 


Mx=-h  + 


N 


This  transformation  is  equivalent  to  moving  the  origin  to  the  point  (A,  0), 
the  unit  of  measure  remaining  the  same. 

19.  Let  h  =  75,  and  use  the  conclusion  in  the  preceding  exercise  to  find 
Mx  for  the  data  in  Table  15.    The  tabular  diagram  should  be  as  follows: 

Mx  FOR  Table  15  when  h  —  lb 


X 

f(X) 

U  =  X  -75 

UfiX) 

95 

4 

20 

80 

90 

6 

15 

90 

etc. 

etc. 

etc. 

etc. 

Total 

20.  Let  =  —    or    wx^  =  X, 

w 

w  being  a  constant,  and  show,  using  (2),  that 

N 

This  transformation  is  equivalent  to  expressing  the  variates  in  class  units, 
the  origin  remaining  the  same. 
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21.  Let  =  5,  and  use  the  conclusion  in  the  preceding  exercise  to  find 
Mx  for  the  data  in  Table  15.    The  tabular  diagram  should  be  as  follows: 


Mx  FOR  Table  15  when  a;  =  5 


X 

fiX) 

/  X 
^  =6 

x'KX) 

95 

4 

19 

76 

90 

6 

18 

108 

etc. 

etc. 

etc. 

etc. 

Total 

22.  Using  h  =  54,  and  the  results  of  Exercise  18  above,  find  Mx  for  the 
distribution  in  Exercise  3,  page  42. 

24.  A  SHORT  METHOD  FOR  COMPUTING 
THE  ARITHMETIC  MEAN 

It  frequently  happens  that  the  distribution  under  consideration  has 
large  values  for  X,  large  values  for  /(X),  or  large  values  for  both 
X  and  /(X),  and  the  consequent  arithmetical  work  for  computing 
Mx  and  other  statistical  constants  becomes  very  tedious.  In  such 
cases  it  is  convenient,  sometimes  necessary,  to  simplify  the  numbers 
so  that  we  can  save  much  labor.  Three  possible  steps  may  be  taken. 
We  may: 

1.  Change  the  unit  of  measure  as  in  Exercise  20,  page  70. 

2.  Express  the  variates  as  measures  from  some  new  origin  (frequently 
called  the  provisional  mean  or  the  guessed  mean)  as  in  Exercise  19, 
page  70. 

3.  Combine  1  and  2  to  change  the  unit  of  measure  and  express  the 
variates  as  measures  from  some  new  origin  as  in  Exercise  16,  page  69. 

We  shall  derive  the  appropriate  formula  for  the  third  possibility, 
and  show  that  the  others  are  special  cases  of  it.  To  do  this  let:  ^ 

=  ^  ^  or  X  =  wx'  +  h 

w 

where: 

h  =  the  distance  in  original  units  from  0  to  the  new  origin  0' 
w  =  the  class  width 

=  the  deviation  of  X  from  h  expressed  in  class  units 

1  See  Figure  1,  page  73. 
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Then  applying  Theorems  I  and  II  of  Section  4  (p.  9),  we  have: 

„  _  SCwx'  +  h)f{X)  _  llwx'fjX)  2hf{X) 

N     ~  N  "      N  N 

_  w'Sx'fjX)  hSfjX) 
~      N      ^  N 

M  =  h  +  ^^^  (3) 

since  S/(X)  =  N. 

The  quantity  — ~ — -  is  usually  denoted  in  statistical  work  by  bx 

or  by  v[  (read:  nu  one  prime),  and  is  called  the  first  moment  about 
the  arbitrary  origin  (h,  0),  expressed  in  class  units.   Hence  we  have: 

M  =  h  +  wbx  =  /i  +  wp[  (3a) 

^^^^^  .        ,  .  ^xJ(X) 

«7x  -  ^1  -  AT 

If  ft  =  0,  we  get  the  results  previously  mentioned  in  Exercise  20 
on  page  70,  and  if  i(;  =  1,  we  get  the  results  equivalent  to  those  men- 
tioned in  Exercise  18  on  page  70.  We  shall  refer  to  the  computation 
of  M  by  (3)  as  ^Hhe  short  method  of  computing  M." 

Figure  1  on  page  73  gives  the  graphical  representation  of  this 
development.  We  shall  refer  to  this  figure  many  times,  hence  it 
should  be  well  mastered. 

We  have  here  the  frequency  curve  Y  —  f{X)  referred  to  the  axes 
OX  and  OY.  The  point  P  whose  coordinates  are  (X,  f{X))  is  any 
point  on  the  curve. 

0'  =  (/i,  0)  is  the  arbitrary  origin  or  guessed  mean.  It  should  be 
chosen  at  a  class  mark  near  the  center  of  the  distribution. 

wbx  =  the  distance  from  0'  to  M. 

Evidently: 

X  =  h  +  wx' 
OM     M     h  +  wbx 

The  distance  MR,  which  is  the  deviation  of  any  measure  from  the 
mean,  will  be  needed  in  the  next  chapter.  It  is  represented  by 
small  X, 
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Figure  1 

Y 


\^ —  X '  class  units  >| 

rf— original  units — >\ 


Let  us  apply  formula  (3)  to  compute  M  for  the  distribution  of 
Table  15. 

Let  h  =  75  and  w  =  5.   We  then  have: 

X  =  5a;'  +  75    or    x'  =  ^ 
and  Table  15  becomes: 

Table  17.  M  for  Table  15  when  h  =  75  and  w  =  5 


X 

fiX) 

,     X  -  75 

xJiX) 

95 

4 

4 

16 

90 

6 

3 

18 

85 

12 

2 

24 

80 

19 

1 

19 

75 

37 

0 

0 

70 

24 

-  1 

-  24 

65 

11 

-  2 

-  22 

60 

6 

-  3 

-  18 

55 

4 

-  4 

-  16 

50 

2 

-  5 

-  10 

Total 

125 

-  13 
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N  125 
M  ^  h  +  wbs     75  +  5(-^^')  =  74.48 


EXERCISES 

1.  Using  the  short  method,  compute  M  for  the  distribution  (a)  of 
heights,  Exercise  1,  page  54. 

2.  Compute  M  of  weights.  Exercise  1  (b),  page  54,  by  the  short 
method. 

3.  Compute  M  for  the  distribution  of  head-breadths.  Exercise  2, 
page  54,  by  the  short  method. 

4.  a.  Show  that  M  for  the  first  iV  integers,  1,2,  3,  .  .  . ,  N  is  (N  +  l)/2. 
b.  Show  that  M  for  the  first  N  odd  integers,  1,3,.  .  . ,  (2A^  -  1)  is  A^. 
6.  The  salaries  of  100  male  employees  of  the  Smith- Jones  Machine 

Company  were  arranged  into  two  groups  of  40  and  60  men  with  mean 
weekly  salaries  of  $24.96  and  $36.47  respectively.  What  was  the  mean 
salary  of  the  total  group? 


1st  group 

2nd  group 

Total  group 

Ni  =  40 

N2  =  60 

N  =  m 

Ml  =  $24.96 

M2  =  $36.47 

M  =  (  ) 

6.  Twenty-five  employees  of  the  Smith-Jones  Machine  Company 
earned  $764.38  in  a  week,  and  fifteen  other  employees  earned  $638.92 
during  the  same  period.  What  was  the  mean  weekly  salary  of  the  forty 
employees? 

7.  If  in  a  series  of  A^i  observations,  the  arithmetic  mean  is  Af  1,  and  in  a 
second  series  of  N2  observations,  the  arithmetic  mean  is  M2,  show  that  for 
the  entire  group  oi  N  =  Ni  -\-  N2  observations: 

Combmed  mean  M  =  —  

8.  Generalize  Exercise  7  above  for  n  groups  and  show  that: 

^       ^.       ,  , ,        N,M,  +  N2M2  +  •    •    •  +  NnMn 

Combmed  mean  M  —  —   =  — — — 

N  N 

where  AT  =      +  A^o  +  •  •  •  +  A^n 

9.  Prove:  Max  =  aM^.  Illustrate. 

10.  Prove:  Max+b  =  dMx  +  b.  Illustrate. 

11.  The  sales  record  of  a  certain  firm  showed  the  following  items: 
800  articles  at  10  cents;  400  articles  at  25  cents;  300  articles  at  50  cents. 
What  was  the  average  price  per  article? 

12.  The  following  data  taken  from  Bulletin  435  of  the  U.S.  Bureau  of 
Labor  Statistics,    Wages  and  Hours  of  Labor  in  the  Men's  Clothing 
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Industry,  1911-1926,"  give  the  weekly  earnings  in  1926  of  Hand  Sewers 
on  Men's  Coats  in  St.  Louis,  Cincinnati,  and  Cleveland.  Compute  M 
for  each  distribution. 


Number  of  Employees 


w  eemy  earnings 

Cincinnati 

Cleveland 

St.  Louis 

$0 

a.u. 

$2 

1 

1 

0 

2 

n 

4 

1 

2 

2 

4 

(t 

6 

2 

1 

2 

6 

(( 

8 

2 

1 

4 

8 

1 1 

10 

6 

3 

11 

10 

12 

14 

4 

13 

12 

a 

14 

27 

10 

28 

14 

n 

16 

15 

12 

28 

16 

i  % 

18 

19 

14 

29 

18 

(( 

20 

15 

22 

21 

20 

(( 

22 

9 

28 

13 

22 

ic 

24 

7 

33 

6 

24 

a 

26 

7 

26 

2 

26 

it 

28 

4 

14 

2 

28 

({ 

30 

2 

18 

4 

30 

(( 

32 

2 

13 

1 

32 

i( 

34 

3 

3 

1 

34 

(C 

36 

1 

4 

1 

36 

38 

3 

0 

2 

38 

(I 

40 

0 

0 

1 

Total 

140 

209 

171 

Strength 
{lbs.  per  sq.  in.) 

Number  of  Bricks 

230-  370 

1 

380-  520 

1 

630-  670 

6 

680-  820 

38 

830-  970 

80 

980-1120 

83 

1130-1270 

39 

1280-1420 

17 

1430-1570 

2 

1580-1720 

2 

1730-1870 

0 

1880-2020 

1 

Total 

270 

The  data  of  the  adjacent  table 
give  the  transverse .  strength  of 
bricks  in  pounds  per  square 
inch.  They  are  taken  from: 
American  Society  of  Testing  Ma- 
terials, Vol.  33,  Part  I,  p.  458. 
(Measurements  made  to  nearest 
10  pounds.) 

Compute  M, 
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25.  THE  MEDIAN,  Md 

A  second  measure  of  central  tendency,  one  that  has  a  wide  usage 
in  statistical  work,  is  the  median.  Roughly  speaking,  the  median  of  a 
set  of  numbers  is  the  middle  one  of  the  set  when  they  are  arranged  in 
order  of  magnitude.  Thus,  if  the  set  of  numbers  33,  93,  45,  83,  72, 
97,  21,  67,  91,  46,  82  be  arranged  in  the  order  of  their  size:  21,  33,  45, 
46,  67,  72,  82,  83,  91,  93,  97,  the  middle  number,  72,  is  called  the 
median  number.  Since  there  are  eleven  numbers  in  the  above  set, 
the  sixth  number  is  the  median.  In  general,  if  there  are  N  numbers 
in  a  set  arranged  in  the  order  of  their  size  (i.e.,  arrayed '0,  the 
median  number  is  the  one  that  corresponds  to  (N  +  l)/2.  If  is 
even,  obviously  there  is  no  middle  number.  In  this  case  the  median 
is  commonly  taken  to  be  one-half  the  sum  of  the  two  middle  numbers. 
For  example,  the  median  of  the  set  6,  7,  9,  12,  16,  20  is  usually  taken 
to  be  (9  +  12)/2  =  10.5. 

If  the  measures  are  tabulated  in  a  frequency  distribution,  we  shall 
define  the  median  as  the  point  on  the  X-scale  such  that  one-half  the 
measures  are  below  it  and  one-half  are  above  it.  On  the  histogram, 
frequency  polygon,  or  frequency  curve,  it  is  that  point  on  the  X-axis 
at  which,  if  an  ordinate  is  erected,  the  area  of  the  histogram,  polygon, 
or  curve  will  be  bisected.  The  class  interval  in  which  the  median  is 
found  is  called  the  median  class. 

In  Section  18  (p.  48)  we  had  a  little  work  in  computing  the  median 
in  simple  situations.  Let  us  now  derive  a  formula  for  finding  Md 
by  looking  at  the  matter  from  a  slighly  different  point  of  view. 


Let:  N 

=  the  total  frequency 

w 

=  the  class  width 

b2 

=  the  lower  class  boundary  of  the  median  class 

B, 

=  the  upper  class  boundary  of  the  median  class 

n2 

=  the  total  frequency  of  all  classes  less  than  62 

=  the  total  frequency  of  all  classes  greater  than  B2 

h 

—  the  frequency  of  the  median  class 

22 

=  the  distance  from  62  to  the  median 

Ma 

=  the  median 

Since  w  is  the  width  of  each  rectangle  and  the  altitudes  are/(Zi), 
/(X2),  .  .  .,  f{Xn)j  the  area  of  the  histogram  is: 
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area  =  wf(Xi)  +  wf{X2)  + 


+  wfiXn) 

+  /(Xn)]  =  wXfiX)  =  wN 


Figure  2 


Y 


0 


LR  T 
K 


I 


/o 


A  X^  A2 


That  is,  area  wN  represents     measures,  and  therefore 

wN  N 
area       represents     measures,  and 

area  wn^  represents  712  measures. 
From  the  figure  we  have : 


ABKb2  +  hJLRMa  = 


wN 


or 


from  which  we  obtain 


wN 


2, 


Hence  the  median  is  given  by: 


2"  ""^ 

/2 


'N 


-  na 


Md  =  ba  +  2:2  =  &a  + 


w 


(4) 


The  student  should  note  especially  that  the  value  of  the  median 
requires  the  class  boundary,  not  the  class  mark,  of  the  median  class. 
Once  the  median  class  is  determined  we  know  immediately  712, 
62,  and/2.   Then  computing  Md  is  decidedly  simple. 

Our  first  task,  then,  in  computing  Md  is  to  determine  the  median 
class.   To  do  this  we  find  N/2,  begin  at  the  lower  end  of  the  scale 
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and  add  the  frequencies  in  the  successive  classes  until  the  lower 
limit,  62,  of  the  class  containing  the  median  is  reached.  We  then 
have  the  median  class  and,  incidentally,  1x2. 

Next  we  find  N/2  —  n2,  and  observe  the  frequency  of  the  median 
class,  /z.  We  now  have  all  the  elements  required  by  formula  (4; ; 
hence,  substituting  the  values,  we  find  M^. 

Consider  the  data  in  Table  18  as  an  illustrative  example. 

Table  18.  Computing  Md  for 
Semester  Grades  of  125 
Students  in  College 
Algebra 


Class 

92.5-97.5 
87.5-92.5 
82.5-87.5 
77.5-82.5 

72.5-77.5 

4 

6 
12 
19 

  37-/2 

67.5-72.5 
62.5-67.5 
57.5-62.5 
52.5-57.5 
47.5-52.5 

24  T 
4  II 

Total 

125  =  N 

We  have: 

w  5 
N  =  125 
N 

f  =  62.5 

/2  =  37 
62  =  72.5 

n2  =  47 

Hence  by  (4) : 

Md  =  72.5  +  (  ^  15 

=  74.595  =  74.6  (approx.) 


Employing  the  assumption  made  in  Section  12  to  the  effect  that 
the  items  of  a  class  are  uniformly  or  evenly  distributed  over  the 
interval,  we  can  find  the  median  by  simple  interpolation  and  thus 
be  freed  from  the  tedium  of  remembering  a  formula. 

Consider  the  data  of  Table  8  on  page  26.  We  count  from  the 
lower  values  and  determine  72.5-77.5  to  be  the  median  class.  Be- 
low this  class  are  found  2  +  4  +  6  +  11  +  24,  or  47,  scores.  We 
need  to  move  up  the  scale  above  72.5  a  distance  22  until  we  obtain 
15.6  scores  from  the  37  scores  of  the  median  class,  and  thus  have 
47  +  16.5  =  62.5,  or  iV/2,  scores.  By  simple  proportion  we  set  up 
the  equation  for  determining  22  and  thus  find  Md.  The  following 
diagram  may  assist  in  understanding  the  solution. 

^  From  this  point  forward  we  shall  designate  the  class  frequency  corresponding 
to  X,  x\  or  %  by  /(x). 
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Diagram  10 


Distances 


Frequencies 


Z2 

5 


15.5 

37 


=  ^  =  2.095 


Md  =  72.5  +  22  =  74.595  c.u. 


EXERCISES 

Compute  the  medians  of  the  following  distributions: 

1.  The  distribution  of  Exercise  1,  page  41. 

2.  The  distribution  of  Table  13,  page  49. 

3.  The  distributions  (a)  and  (b)  of  Exercise  1,  page  54. 

4.  The  distributions  of  Exercise  12,  page  74. 

5.  The  distribution  of  Exercise  13,  page  75. 

6.  Refer  to  Figure  2,  and  by  equating  to  wN/2  the  area  VSTRMdV, 
show  that  the  median  is  given  by: 


AT 


W 


7. 


Class 

fix) 

30  a.u.  33 

4 

27  30 

8 

24  "  27 

16 

21  24 

0 

18  "  21 

12 

15  18 

12 

12  "  15 

4 

Total 

56 

According  to  the  definition,  what  may  be  the 
median  of  the  adjacent  distribution?  At  what 
point  would  you  take  the  median? 


8.  Compute  the  median  for  the  distribution  of  Exercise  3,  page  42. 
Since  this  is  a  distribution  of  discrete  data,  what  interpretation  can  you 
give  to  this  median? 
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26.  THE  MODE,  Mo 

A  mere  glance  at  Table  8  (p.  26)  informs  us  that  the  class  inter- 
val 72.5-77.5  has  the  greatest  frequency.  It  is  called  the  modal  class. 
The  class  mark,  75,  of  the  modal  class  is  called  the  cricde  mode. 

The  mode  may  be  roughly  defined  as  the  measure  that  occurs 
most  frequently.  The  modal  height  of  twelve-month-old  white  boys 
is  about  29  inches,  for  there  are  more  twelve-month-old  white  boys 
29  inches  high  than  for  any  other  height.  Any  haberdasher  will 
tell  you  that  there  are  more  calls  for  shirts  of  size  15  than  for  any 
other  size;  hence  the  modal  size  for  shirts  is  15.  The  mode  is  the 
typical  measure,  the  fashionable  measure,  la  mode.  It  is  probably 
what  the  layman  understands  as  the  average.^' 

The  true  mode  is  easy  to  define  but  very  difficult  to  determine. 
The  true  mode  is  the  value  of  X  at  which  the  ideal  frequency  curve 
which  best  fits  a  set  of  data  has  a  maximum.  Of  course  the  subject 
of  fitting  frequency  curves  in  general  is  beyond  the  scope  of  this 
text,  but  we  may  state  that  the  ideal  curve  for  a  given  distribution 
is  difficult  to  find. 

The  mode  is  roughly  approximated  by  the  mid-point  of  the  class 
with  the  greatest  frequency.  We  appropriately  call  this  value  the 
crude  mode.  We  obtain  a  closer  approximation  to  the  true  mode  by 
making  a  correction  upon  the  crude  mode.  This  correction  is  made 
by  a  process  of  interpolation.  Such  interpolation  is  usually  based 
upon  the  values  that  determine  the  modal  class  and  its  two  adjacent 
classes  which  we  choose  to  call  the  three  ^^centrar^  classes. 

While  it  is  true  that  for  most  mound-shaped  distributions  the 
mode  is  in  the  central  part  of  the  distribution,  it  is  not  unusual  to 
encounter  a  mode  near  one  of  the  extremes  of  the  distribution.  When 
this  does  occur  the  mode  is  certainly  an  important  measure  of  central 
tendency. 

Of  the  several  methods  we  shall  use  to  determine  an  approximate 
mode,  probably  the  method-of-the-parabola  is  the  best.  Although 
the  mode,  like  the  median,  does  not  behave  beautifully  in  the  algebra 
of  statistics  and  does  not  integrate  conveniently  in  the  description 
of  the  more  complex  features  of  statistical  phenomena,  it  deserves 
a  careful  consideration.  Let  us  now  proceed  to  the  problem  of  this 
section,  how  to  find  an  approximate  mode. 
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Consider  the  grades  in  College  Algebra,  Table  8  (p.  26).  The  modal 
class  and  the  two  adjacent  classes  are 

X  fix) 

80  19 
75  37 
70  24 

There  is  a  well-defined  modal  class,  namely,  that  with  the  class 
mark  of  75.  Further,  since  there  are  24  members  in  the  70  class  and 
19  members  in  the  80  class,  certainly  the  mode  should  be  drawn  from 
75  toward  70  because  of  the  added  weight  of  the  70  class.  Evidently 
the  mode  is  located  in  the  72.5-77.5  interval,  the  point  to  be  de- 
termined by  the  weights  of  the  adjacent  classes.  Consider  the  fol- 
lowing diagram. 

Diagram  11 

72.5  Mo     75  77.5 

J —  — _  1  1  J 

T 

<  Z  

19 

24 

Let  the  frequencies  24  and  19  be  considered  as  weights  suspended 
at  the  ends  of  the  modal  class  interval.  Let  z  be  the  amount  that  must 
be  added  to  the  lower  boundary  72.5  to  give  the  approximate  mode. 
In  order  that  the  weights  shall  balance  at  Mo,  we  must  have: 

2iz  =  19(5  -  z) 

from  which  we  obtain 

z  =  2.2 

and 

Mo  =  72.5  +  z  =  74.7 

This  illustrates  the  well-known  method  given  by  Professor  W.  I. 
King  in  his  Elements  of  the  Statistical  Method^  page  124.  In  general, 
let: 

/«i  =  the  frequency  of  the  class  next  lower  than  the  modal  class 
/i  =  the  frequency  of  the  class  next  higher  than  the  modal  class 
h  =  the  lower  boundary  of  the  modal  class 
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B  ~  the  upper  boundary  of  the  modal  class 
w  =  the  class  width 

z  =  the  amount  which  must  be  added  to  b  to  give  Mo 

If  the  frequencies  are  suspended  as  weights  at  the  ends  of  the 
modal  class  interval,  in  order  for  the  weights  to  balance  at  Mo,  we 
must  have: 

Diagram  12 


B 


w  z 


from  which 


/_l2  =  /i(m>  -  Z) 
fi 


and 


/-I  +  /i/ 


w 


Mo  =  b  +  z  =  b  + 


•1  +  fJ 


w 


(5) 


It  may  be  argued  that,  to  be  consistent  with  Section  22  (p.  60),  the 
frequencies  should  be  suspended  at  the  mid-points  of  the  respective 
class  intervals.  We  shall  give  some  exercises  at  the  end  of  this  section 
that  will  involve  that  very  point. 

A  second,  and  possibly  a  closer,  approximation  to  the  mode  can 
be  found  by  passing  a  quadratic  parabola  through  the  three  central 
points  and  finding  the  value  of  X  for  which  Y  or /(a:)  has  a  maximum. 

The  student  may  recall  from  elementary  or  college  algebra  that 

represents  a  parabola;  that  it  has  a  maximum  if  a  is  negative,  a 
minimum  if  a  is  positive,  as  in  Figures  3  and  4.  It  can  be  shown 
in  several  ways  that  the  coordinates  of  the  bend  points,  m  and 
My  are: 


h     4ac  —  ¥ 
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Figure  3 
M 


Figure  4 


a  negative 


X 


That  is,  if  a  is  negative,  the  value  of  X  for  which  aX'^  +  6X  +  c  is  a 
maximum  is: 

b 


2a 


For  example,  aX^  +  hX  +  c  can  be  put  into  the  form: 
If  a  is  negative,  the  largest  value  is  obtained  when  X  +  ^  =  0]  that 


is,  when  X  =  — 


2a 


Figure  5 
M 


(76,37) 


(70M\ 


Similarly,  if  a  is  pos- 
itive, the  smallest 
value   is  obtained 

when  X  =  —  k" 

2a 

Let  us  apply  this 
method  to  the  dis- 
tribution of  grades 
in  college  algebra 
the  three  central" 
classes  for  which 
were  given  on  page 
81.  When  they  are  plotted  they  appear  as  in  Figure  5. 

We  have  the  three  points  on  the  curve  as  shown.  The  equation 
of  the  curve  is: 

Y     AX^  +  BX  +  C 
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Substituting  the  coordinates,  we  have: 

24  -  A{70y  +  B{70)  +  C 
37  =  A  (75)2  +  B(75)  +  C 
19  =  A  (80)2  +  B{SO)  +  C 

Solving  for  Ay  Bj  and  C,  we  obtain 

A  =  -  0.62,    B  =  92.5,    (7  =  ~  3413 

and  the  equation  of  the  curve  passing  through  the  three  given  points 
is: 

F  =  -  0.62X2  +  92.5X  ~  3413 
The  value  of  X  for  which  Y  is  greatest  is 

B  92.5  ^  .  p-r^i-r 

^  =  -  2A  =  m  =  ^^-^^^ 

and  this  is  an  approximate  mode. 

The  algebra  can  be  greatly  simplified  ^  by  using  the  class  width 
(w  =  5)  as  a  unit  and  by  moving  the  origin  to  the  point  (75,  0) 
where  h  =  75,  the  crude  mode.  We  then  have: 

Z  =  5x'  +  75    or   x'  = 

o 

and  the  equation  of  the  curve  is  now  of  the  form: 

Y  =  ax'2  +  bx'  +  c 

Figure  6  exhibits  the  (x',  Y)  coordinates  of  the  three  '^central" 
points. 


Figure  6 


*  See  Exercises  40  and  41,  page  110,  for  solutions  based  upon  determinants. 
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Substituting  the  coordinates,  we  have: 

24  =  a(~  1)2 +        1)  +c 
37  =  a(Oy  +  6(0)  +  c 
19  =  a{iy  +  6(1)  +  c 

Solving  for  a,  6,  and  c,  we  have: 

a  =  -  15.5,    6  =  ~  2.5,    c  =  37 

The  value  of     at  the  mode  is 

^  2a  31 

and  the  value  of  X  at  the  mode  is: 


X  =  5a:'  +  75  =  - 


12.5 
31 


+  75  =  74.597  as  before.  ^ 


For  mound-shaped  distributions  that  are  moderately  asymmetrical 
and  also  possess  a  moderate  peakedness  near  the  center,  as  in  Figure  7, 
the  formula,  due  to  Karl  Pearson,  Figure  7 

Mx  -  Mo     3{Mx  -  Md) 

has  been  found  to  be  approxi- 
mately true.  Since  the  median 
and  the  arithmetic  mean  are  not 
difficult  to  compute,  this  formula 
may  be  used  to  advantage  in 
finding  Mo  for  certain  types  of 
distributions.  Owing  to  the  fact  that  the  distribution  of  college 
algebra  grades  is  very  peaked,  this  formula  cannot  be  expected  to 
check  very  satisfactorily. 

EXERCISES 

Find  the  approximate  modes  by  three  different  methods  for  each  of  the 
following  distributions: 

1.  Of  Exercise  1,  page  41.  3.  Of  Exercise  1(a),  page  54. 

2.  Of  Exercise  3,  page  42.  4.  Of  Exercise  2,  page  54. 

6.  Assume  that  the  class  frequencies  /_i  and  /i  of  the  classes  adjacent  to 

1  The  method  of  determining  an  approximate  mode  by  passing  a  quadratic 
parabola  through  three  points  gives  the  same  result  as  the  method  of  finite  differ- 
ences given  by  Czuber,  Die  statistischen  Forschungsmethodeny  p.  71,  which  is  men- 
tioned by  Professor  Rietz  in  the  Handbook  of  Mathematical  Statistics,  p.  27. 
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the  modal  class  are  suspended  as  weights  at  their  class  marks  as  in  the  figure, 
and  show  that,  if  the  weights  balance  at  Moi 


and 


["3/1 -A] 
L2(/i  +  UiU 


w 


L2(/x+A)J 


w 


Diagram  13 
Mo 


B 


6.  Assume  that  /o,  the  frequency  of  the  modal  class,  and  /_i  and  /i,  the 
frequencies  of  the  classes  adjacent  to  the  modal  class,  are  suspended  as 
weights  at  their  respective  class  marks,  as  in  the  figure,  and  show  that  if 
the  weights  balance  at  Mo : 


and 


L2(/_i  +  /o  +  /i)  J 


w 


Mo  =  6  +  2  =  6  + 


'/o  +  3/i~Ai] 
_2(/^x+/o+/i)J 


w 


Diagram  14 


B 


W 
2 


o 


2 


7.  Show  that  the  value  of  Mo  in  Exercise  6  above  is  the  arithmetic 
mean  of  the  modal  group  and  the  two  groups  adjacent  to  it;  that  is,  sho\^ 
that: 

x-iZ-i  +  Xofo  +  Xi/i 


= 


Hmt:   X-i  =  6  -  V       =  6  +  ^,  and       =  6  +  ^ 
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Table  19 


27.  THE  GEOMETRIC  MEAN,  Mg,  AND  KINDRED  TOPICS 

In  the  preceding  pages  of  this  chapter  considerable  attention  has 
been  devoted  to  the  three  measures  of  central  tendency  that  are  most 

widely  used  —  the  arithmetic  mean,  the  median,  and 
the  mode. 

Not  all  data  are  most  logically  averaged  by  these 
measures.  From  several  points  of  view  the  best  aver- 
age for  the  numbers  2,  4,  and  8  is  not  4f ,  their  arith- 
metic mean,  but  4  =  ^2  •  4  •  8,  their  geometric  mean. 
The  most  logical  average  for  the  set  of  numbers  2,  4, 8, 
16,  32,  64,  128  is  their  geometric  mean,  the  seventh 
root  of  their  product,  which  is  16.  It  is  a  member  of 
the  group  and,  when  the  data  are  plotted,  is  in  the 
curve  or  trend  of  the  data.  The  geometric  mean  is  represented  by 
the  point  B,  and  the  arithmetic  mean  by  the  point  C  (4,  36f). 


X 

2' 

1 

2 

2 

4 

3 

8 

4 

16 

5 

32 

6 

64 

7 

128 

Figure  8 
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We  have  learned  in  college  algebra  that  series  in  which  the  quanti- 
ties increase  or  decrease  at  each  interval  by  a  constant  percentage  of 
the  value  at  the  beginning  of  the  interval  are  in  geometric  progression. 
In  different  words,  if  the  ratio  of  any  number  to  the  preceding  number 
is  constant,  the  numbers  are  in  geometric  progression. 

It  is  to  such  classes  of  numbers  that  the  geometric  mean  most 
logically  applies  as  an  average.  In  observed  data  it  is  not  expected 
that  the  ratio  of  any  number  to  the  preceding  number  will  be  abso- 
lutely constant;  however,  if  the  ratio  is  approximately  constant j  in 
such  data  the  geometric  mean  is  preferable. 

The  geometric  mean,  then,  is  widely  used  in  averaging  rates  of 
increase  or  decrease,  such  as  in  the  study  of  the  growth  of  any 
statistical  population,  growth  of  skill  in  an  individual,  relative 
changes  in  the  prices  of  commodities  —  in  short,  any  data  that 
approximately  satisfy  the  previously  stated  criterion. 

Consider  the  following  table: 

Table  20.   Population  of  the  Continental  United  States^ 


Population 

Ratio  of  Each 

Year 

X 

Item  to  the 

(millions) 

One  above 

1910 

92.0 

1920 

105.7 

i.is 

1930 

122,8 

1.16 

Since  in  this  particular  period  of  twenty  years  the  ratios  are  1.15 
and  1.16,  essentially  constant,  we  assume  that  the  populations  are 
in  geometric  progression,  and  their  average  would  be  their  geometric 
mean,  namely: 

M,  =  v'(92.)  (105.7)  (122.8) 

To  evaluate  this  we  shall  use  logarithms,  and  write: 

log  M,  =  ^[log  92.0  +  log  105.7  +  log  122.8] 
=  |[1.9638  +  2.0241  +  2.0892] 
=  4C6.0771]  ==  2.0257 
and  Mg  =  106.1  millions 

*  The  data  are  taken  from  the  Fifteenth  Census  of  the  United  States,  Vol.  I, 
Population,  p.  6. 
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Further,  if  we  assume  that  the  decade  rate  of  growth  will  continue 
for  the  next  decade,  then  the  population  for  example  in  1940  will  be 
1.16(122.8)  =  142.4  millions. 

We  have  noted  that  the  decade  rate  of  growth  from  1910  to  1920 
is  1.15  or  115  per  cent.  Suppose  we  are  interested  in  the  annual 
rate  of  growth,  which  we  assume  is  constant  during  the  decade,  then 
we  may  interpolate  the  population  for  the  years  1911,  1912,  etc. 

Let: 

r  =  the  annual  rate  of  increase 
Po  =  the  population  in  1910 
Pi  ~  the  population  in  1911  =  Fo  +  Pqt 
P2  =  the  population  in  1912  ==  Pi  +  Pir 


=  Po(l  +  r) 
=  Po(l  +  r)2 


Pio  =  the  population  in  1920  =  Po(l  +  r^^ 
Therefore: 

92(1  +  r)^o  105.7 

To  solve  this  equation,  we  may  use  logarithms.  Hence: 

10  log  (1  +  r)  =  log  105.7  -  log  92 

=  2.0241  -  1.9638  =  0.0603 
log  (1  +  r)  =  0.00603 
(1  +  r)  =  1.014 

r  =  0.014  =  1.4  per  cent 

Hence: 

p,  =         +r)  =  92(1.014)  =  93.3  millions  in  1911 
P2  -  Pi(l  +  r)  =  93.3(1.014)  =  94.6  millions  in  1912 

If  we  assume  the  same  annual  rate  to  continue  from  1920  to  1921, 
then  the  population  in  1921  is  given  by: 

Pii  ==Pio(l  +  r)  =  105.7(1.014)  =  107.2  millions  in  1921 

We  are  now  ready  to  define  the  geometric  mean  of  N  measures  to  be 
the  Nth  root  of  their  product.  If  Xi,  X2,  . . . ,  Xm  are  the  measures,  then: 

Mg  =^  \/'Xi  '  Xi  *  '  '  '  Xm  (6) 
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It  is  convenient  to  express  this  equation  in  logarithmic  form,  thus : 


log  Mg  = 


_  log  Xi  +  log  X2  +  —  •  +  log  Xn 


N 


S  logX 

N 

In  other  words,  the  logarithm  of  the  geometric  mean  is  equal  to 
the  arithmetic  mean  of  the  logarithms  of  the  original  measures. 

If  the  data  are  arranged  in  the  form  of  a  frequency  di^ribution  — 
that  is,  if  Xi  appears  f(xi)  times,  X2,  f{x2)  times,  and  so  on  —  the 
formula  becomes: 


Ma  =  ^X/^'i)  •  Xa/^^a)  ....  Xn^^^n) 


(7) 


where 


N  =  fix,)  +  /(X2)  + 


Suggested  Exercise:  Using  the  frequencies  in  formula  (7)  as  weights, 
show  that  the  logarithm  of  a  weighted  geometric  mean  is  the  weighted 
arithmetic  mean  of  the  logarithms  of  the  measures;  that  is: 

S/(a:01ogX, 


log  Mg  = 


N 


Example  1.  The  following  table  gives  for  the  years  indicated  the 
number  of  divorces  in  the  United  States  per  1,000  marriages.    Find  Mg} 


Year 

No.  of  divorces 
per  1,000  marriages 
X 

LogX 

1906 

84 

1.9243 

1916 

108 

2.0334 

1922 

131 

2.1173 

1926 

150 

2.1761 

1931 

170 

2.2304 

10.4815 
2.0963 
124.9 


SlogX 
SlogX 


log  M, 


Example  2.  Find  the  annual  rate  of  increase  of  the  divorce  rate  for 
the  period  1906-1916.   (See  Example  1  above.) 

1  Four-place  tables  of  logarithms  and  anti-logarithms  are  found  in  the 
Appendix. 
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Solution.  Let  n  be  the  rate  of  increase.  Following  the  line  of  reasoning 
that  was  used  on  page  89,  we  find 

84(1  +  ri)i«  =  108 
Taking  logarithms, 

log  108  -  log  84 


log  (1  +  n)  = 


10 

2.0334  -  1.9243 


0.0109 


10 

1  +  rr=  1.026 

ri  =  0.026  ==  2.6% 

Exercise.  Find  the  rates  of  increase  of  the  divorce  rate  for  the  periods 
1916-1922,  1922-1926,  1926-1931. 

Problem  1.  Prepare  a  skeleton  table  with  the  proper  headings  for  finding 
the  geometric  mean  of  a  frequency  distribution.   See  formula  (7). 

Problem  2.  A  population  Fo  increases  at  a  constant  rate  r  per  period 
for  n  periods.  Show  that  the  population  P„  at  the  end  of  n  periods  is 
given  l3y 

Pn  =  Po(l  +  ry 

Problem  3.  A  population  Pq  decreases  at  a  constant  rate  r  per  period 
for  n  periods.  Show  that  the  population,  Pn,  at  the  end  of  n  periods  is 
given  by 

Pn  =  Po(l  ~ 

Problem  4.  If  Mg^x  is  the  geometric  mean  of  A''  X's,  and  Mg,Y  is  the 
geometric  mean  of  N  F's,  then  the  geometric  mean  Mg  of  the  2A^  values 
is  given  by 

Ml    =    MgyX    •  Mg,Y 

Problem  5.  Plot  the  curve:  Y  =  100(1 .+  X)\ 
Problem  6.  Plot  the  curve:  Y  =  100(1  -  Xy. 

Problem  7.  Prove:      "t      >  VYJl, 


EXERCISES 

1.  Complete  column  3  for  the  following  data,  and  note  that  the  ratio  is 
approximately  constant.  Find  the  geometric  mean  for  the  number  of  regis- 
trations. 
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Registration  of  Motor  Vehicles  in  the  United 

States,  1920-19291 


Year 


1920 
1921 
1922 
1923 
1924 
1925 
1926 
1927 
1928 
1929 


2.  The  United  States  gross  imports  of  crude  rubber  increased  from 
252,922  long  tons  in  1920  to  563,812  long  tons  in  1929.  Find  the  annual 
rate  of  increase  during  this  period,  assuming  that  the  rate  of  growth  was 
constant. 

3.  The  value  of  a  machine  decreases  at  a  constant  rate  from  the  cost 
price  of  $1,000  to  the  scrap  value  of  $100  in  ten  years.  Find  the  annual  rate 
of  decrease,  and  the  value  of  the  machine  at  the  end  of  one,  two,  three, 
years. 

4.  The  number  of  divorces  per  1,000  marriages  increased  from  62  per 
1,000  in  1890  to  166  per  1,000  in  1928.  Assuming  the  annual  rate  of  in- 
crease was  constant,  find  its  value.  (See  Statistical  Abstract  of  the  United 
States,  1930,  p.  91.) 

28.  THE  HARMONIC  MEAN,  Mh 

Another  measure  of  central  tendency  that  is  of  value  in  solving 
certain  special  types  of  problems  is  the  harmonic  mean.  Owing  to 
its  unfamiliarity  and  to  the  difficulties  in  interpreting  it,  it  is  probably 
less  used  than  any  of  the  other  measures  of  central  tendency,  yet  for 
certain  problems  its  use  is  very  desirable. 

The  harmonic  mean  is  defined  as  the  reciprocal  of  the  arithmetic 
mean  of  the  reciprocals  of  the  given  numbers.  If  the  given  numbers 
be  Xiy  X2,  .  .     Xi^Ty  then: 

*  The  data  are  taken  from  Statistical  Abstract  of  the  United  States,  1930, 
p.  385. 


Number 
X 

(thousands) 


9,232 
10,463 
12,238 
15,092 
17,594 
19,937 
22,001 
23,133 
24,493 
26,501 


Ratio  of  Each  j  ^ 

Item  to  One  above  ^ 
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 N   AT 


Si 


J^i     X2     Xz  X 
The  harmonic  mean  of  2,  4,  6  is: 

3  36 


The  harmonic  mean  is  especially  useful  in  averaging  time  rateSf 
in  finding  the  average  price  per  unit  when  the  data  give  the  amount 
of  the  commodity  for  a  given  price  —  ^^so  much  for  a  unit  of  money" 
—  and  in  the  development  of  index  numbers.  For  example,  suppose 
a  man  travels  two  miles,  the  first  at  the  rate  of  10  miles  an  hour  and 
the  second  at  the  rate  of  20  miles  an  hour,  what  is  the  average  speed? 
The  obvious'*  answer  of  15  miles  an  hour  is  not  correct  for  the 
man  traveled  only  two  miles  and  he  consumed  only  (1^  +  ^)  of  an 
hour,  or  ^  of  an  hour.  He  would  have  traveled  in  ^  of  an  hour  at 
15  miles  an  hour  the  distance  ^(15)  =  2^  miles,  not  2  miles.  If  r 
is  his  average  rate  in  miles  an  hour,  then: 

(iV  +  ^V)r  =  2 

from  which  we  obtain: 

r  =  13J  miles  per  hour 

and  this  is  the  harmonic  mean  of  the  rates. 

As  a  second  illustration  suppose  a  man  on  a  journey  purchases 
gasoline  as  in  the  following  table: 

Table  21.  Purchase  of  Gasoline 


Dealer 

Number  of  Gallons 
for  $1 
X 

Cost  per  Gallon 
(Dollars) 

1 

8 

i 

2 

12 

3 

10 

We  wish  to  find  the  average  price  per  gallon  and  the  average  number 
of  gallons  for  $1. 

In  the  preceding  illustration  we  assumed  that  the  average  rate 
times  the  total  time  gave  the  total  distance.   Similarly  we  assume 
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here  that  the  average  price  tunes  the  number  of  units  purchased 
gives  the  total  cost. 

CASE  A.  Spending  the  same  amount  with  each  dealer.  Suppose  D 
dollars  are  spent  with  each  dealer.  We  then  have  (8D  +  12D  +  lOD) 
gallons  bought  at  a  total  cost  of  3D  dollars.  Hence  the  average  price 
per  gallon  is  (32))  -^  (30D),  or  10  cents  per  gallon. 

CASE  B.  Buying  the  same  amount  from  each  dealer.  Suppose  G 
gallons  are  purchased  from  each  dealer.  We  then  have  3G  gallons 
purchased  at  a  total  cost  of  (G/8  +  G/12  +  G/10)  dollars.  Hence 
the  average  price  per  gallon  is  (G/8  +  G/12  +  G/10)  (3(?),  or 
10  5/18  cents  per  gallon.  The  reciprocal  of  this  quantity  gives  the 
average  number  of  gallons  for  $1  and  is  the  harmonic  mean  of  the 
given  X  values. 

Exercise.  A  manufacturer  of  rivets  purchased  copper  as  follows: 
In  1918,  4  pounds  for  $1 ;  in  1921,  8  pounds  for  $1 ;  in  1925,  6|  pounds 
for  $1;  in  1932,  20  pounds  for  $1.  Find  the  average  price  per  pound 
and  the  average  number  of  pounds  for  $1  on  two  hypotheses. 

The  observant  student  will  note  that  the  price  of  an  article  may 
be  expressed  in  two  ways, 

(a)  p  units  of  money  per  unit  of  quantity,  or 

(b)  q  units  of  quantity  per  unit  of  money. 

Thus,  the  price  of  sugar  may  be  given  as  (a)  5  cents  per  pound  or 
as  (b)  20  pounds  for  a  dollar  or  J  of  a  pound  for  a  cent. 
Similarly,  the  speed  of  a  moving  body  may  be  expressed  as 

(a)  d  units  of  distance  per  unit  time,  or 

(b)  t  units  of  time  per  unit  distance. 

Thus  the  speed  of  a  car  may  be  given  as  (a)  30  miles  per  hour  or  as 
(b)  2  minutes  per  mile. 

Of  course  we  are  more  familiar  with  prices  and  speeds  expressed 
in  forms  (a)  but  we  need  to  give  attention  to  forms  (b)  since  they 
do  occur.  Moreover,  the  correct  average  will  depend  upon  how  the 
data  are  stated,  as  the  previous  illustrations  confirm. 

The  following  theorems  will  clarify  some  of  the  apparent  confusion 
in  which  we  find  ourselves.  Note  that  in  the  hypotheses  and  the 
conclusions  of  these  theorems  we  assume  that  prices  and  speeds  are 
expressed  in  the  familiar  forms  (a). 
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A.  Average  Speeds  and  Rates 

If 

5  =  speed,  number  of  units  distance  per  unit  time,  and 
t  =  number  of  units  time, 
then  we  define: 

.  J     total  distance  Xst 

Average  speed  =    ^.^al  time  "  s7 

Theorem  I.  If  the  time  on  each  trip  is  constant,  that  is,  if  <  =  c, 
then 

Average  speed  =  —  =  —  =  _  =  -  (10) 

which  is  the  arithmetic  mean  of  the  several  speeds. 

Example  1.  A  man  traveled  by  auto  3  days.  He  drove  10  hours  each 
day.   He  drove 

the  first  day  10  hours  at  45  mi.  per  hr., 

the  second  day  10  hours  at  40  mi.  per  hr.,  and 

the  thifd  day  10  hours  at  38  mi.  per  hr. 

What  was  his  average  speed? 

Solution:  Here  we  have  the  case  in  which  the  time  t  of  each  trip  is 
constant  and  equal  to  10  hours.   Hence,  by  (10) 

.                 .      2s     45  +  40  +  38     ,^     .  , 
Average  speed  ~      ~  2  '  ~ 

The  student  may  verify  this  by  formula  (9). 

Theorem  II.  If  the  total  distance  covered  each  trip  is  constant, 
that  is,  if  st  =  c,  then 

,     2st      2c       Nc  N 
Average  speed  =  —  =  —  =       r  =  — r  (11) 

ZJs 

which  is  the  harmonic  mean  of  the  speeds. 

Example  2.  A  man  traveled  by  auto  3  days.  He  covered  480  miles 
each  day.  He  drove 

the  first  day  10  hours  at  48  mi.  per  hr., 

the  second  day  12  hours  at  40  mi.  per  hr.,  and 

the  third  day  15  hours  at  32  mi.  per  hr. 

What  was  his  average  speed? 
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Solution:  Here  we  note  that  the  total  distance  covered  each  trip  (day) 
is  constant  and  equal  to  480  miles.  Hence,  by  (11) 

Average  speed  ^  ^ 


48     40  ^  32 

3 

=  38§t  mi.  per  hr, 


480 

The  student  may  verify  this  by  formula  (9). 

B.  Average  Prices 

Let 

p  =  price  per  unit  (number  of  units  of  money  per  unit  of 
quantity) 

q  =  quantity  (number  of  units  purchased  at  price  p) 
then  we  define : 

A  •     —      total  amount  spent      _  Zipq 

verage  price     ^q^qI  quantity  purchased 

Theorem  III.  If  the  total  amount  spent  at  each  transaction  is 
constant,  that  is,  if  pq  =  c,  then 

Xpq      2c  Nc 


Average  price  = 


2g  ^yl 

^Jp  AJp 

N 


(13) 


which  is  the  harmonic  mean  of  the  prices. 

Example  3.  Mr.  Jones  usually  spends  $120  a  year  for  coal.  He  bought 
during 

the  first  year  15  tons  at  $8  per  ton, 

the  second  year  12  tons  at  $10  per  ton,  and 

the  third  year  10  tons  at  $12  per  ton. 

What  was  the  average  price  of  the  coal? 

Solution:  Here  we  note  that  the  total  amount  spent  each  year  is  constant 
and  equal  to  $120.  Hence,  employing  (13)  we  find  the 
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,  .  3  3 

Average  price  =        p  =   ;  7-  = 


8  ^  10  ^  12 


p     8  '  10  '  12  120 
=  $9f|  =  $9.73  per  ton 

The  student  may  verify  this  by  formula  (12). 

Theorem  IV.  If  the  same  number  of  units  is  purchased  at  each 
transaction,  that  is,  if  g  ==  c,  then 

.  .        'Zpq     l^pc  cliv 

Average  price  =        =  2^  =  W 

=  if  (14) 

which  is  the  arithmetic  mean  of  the  prices. 

Example  4.  When  Mr.  Brown  purchased  gasoline,  he  regularly  pur- 
chased 10  gallons.  He  purchased 

at  station  A,  10  gallons  at  14 per  gal., 
at  station  B,  10  gallons  at  180  per  gal., 
at  station  C,  10  gallons  at  15^^  per  gal.,  and 
at  station  D,  10  gallons  at  13fi  per  gal. 

What  was  the  average  price  per  gallon? 

Solution :  In  this  case  we  note  that  the  same  number  of  units,  10  gallons, 
was  purchased  at  each  station.  Hence,  by  (14)  we  obtain 

2p     14  +  18  +  15  +  13 
Average  price  =       =  ^  

=  ^  =  150  per  gallon 

The  student  may  verify  this  by  formula  (12). 


EXERCISES 

1.  A  young  man  took  a  trip  by  bicycle.  He  rode  8  hours  each  day. 
He  traveled  32  miles  the  first  day,  28  miles  the  second  day,  24  miles  the 
third  day,  and  20  miles  the  fourth  day.  What  was  his  average  speed? 

2.  A  man  bought  four  kinds  of  apples  at  the  following  prices: 

5  bushels  of  the  first  kind  at  400  per  bu., 
5  bushels  of  the  second  kind  at  500  per  bu., 
5  bushels  of  the  third  kind  at  750  per  bu.,  and 
5  bushels  of  the  fourth  kind  at  $1.00  per  bu. 


What  was  the  average  price  per  bushel? 


98 


MEASURES  OF  CENTRAL  TENDENCY 


3.  William  Smith  purchased  gasoline  from  three  dealers.  He  purchased 

from  A,  20  gallons  at  17^  per  gallon, 
from  B,  10  gallons  at  11^  per  gallon,  and 
from  C,  15  gallons  at  15^  per  gallon. 

What  was  the  average  price  per  gallon? 

4.  Three  ships  make  the  same  round-trip  in  20,  24,  and  30  days  re- 
spectively.  What  was  the  average  number  of  days  for  the  trip? 

5.  In  a  certain  factory  a  unit  of  work  is  completed  by  A  in  4  minutes, 
by  B  in  5  minutes,  by  C  in  6  minutes,  by  D  in  10  minutes,  and  by  E  in 
12  minutes.  Find  (a)  the  average  number  of  units  per  hour,  (b)  the  average 
number  of  minutes  per  unit,  and  (c)  the  total  number  of  units  they  will 
complete  in  8  hours. 

6.  A  man  travels  20  miles  at  40  miles  per  hour,  10  miles  at  30  miles 
per  hour,  and  30  miles  at  60  miles  per  hour.  What  was  his  average  speed? 

7.  Five  boys  were  given  a  page  of  problems  with  the  instruction  to 
solve  as  many  as  they  could  in  an  hour.  A  solved  12,  B  solved  10,  C 
solved  8,  D  solved  6,  and  E  solved  4.  What  was  the  average  number 
of  problems  per  hour  and  the  average  number  of  minutes  per  problem? 

8.  Given  two  unequal  observations  Xi  and  X2,  prove 

Mh  <  M,  <  M 

9.  Given  three  unequal  observations  Xi,  Z2,  and  X3,  prove 

Mh  <       <  M 

10.  a.  Given  N  unequal  observations  Xi,  X2,  X3,  .  .  .  ,  X.v,  prove 

Mh  <       <  M 

b.  How  do  the  three  means  compare  if  all  the  observations  are  equal 
in  value? 

This  is  a  fairly  tough  problem.  For  references,  see  Burgess,  Mathe- 
matics  of  Statistics^  page  101;  Chrystal,  Algebra^  Part  II,  page  46;  Hall 
and  Knight,  Higher  Algebra,  page  211. 

(1  +  N)^ 

11.  Prove  that  the  product  of  the  first  N  integers  is  less  than  -^^  

Hint.  Use  Exercise  10a  above  and  Exercise  4a  page  74. 

12.  Prove  that  the  product  of  the  first  N  odd  integers  is  less  than  A^^. 
Hint.  Use  Exercise  10a  above  and  Exercise  4b  page  74. 

29.  DISCUSSION  AND  CRITICISM  OF  THE  MEASURES 

OF  CENTRAL  TENDENCY 

Owing  to  the  fact  that  many  distributions  tend  to  "pile  up*'  near 
the  center,  we  have  chosen  the  term  central  tendency  to  describe  this 
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behavior.  The  measures  of  central  tendency  are  statistical  constants 
that  give  the  striking  feature  of  the  central,  the  predominant,  the 
typical  variates.  The  arithmetic  mean,  the  median,  and  the  mode 
are  the  most  widely  used.  The  arithmetic  mean  is  that  measure  the 
algebraical  sum  of  the  deviations  from  which  is  equal  to  zero.  The 
median  is  that  quantity  such  that  half  of  the  observed  measures 
exceed  it  in  value  and  half  are  exceeded  by  it.  We  shall  see  later  that 
the  median  is  the  point  from  which  the  sum  of  the  absolute  values 
of  the  deviations  of  all  the  measures  from  it  is  a  minimum.  The 
mode  is  the  value  at  which  the  ideal  frequency  curve  fitted  to  the 
given  distribution  has  a  maximum. 

All  the  measures  of  central  tendency  are  called  averages.  Since 
the  averages  we  have  thus  far  considered  are  so  different  in  their] 
meanings  and  since  we  shall  n^^eet  other  averages  in  succeeding 
chapters,  to  the  statistician  the  term  average  is  quite  indefinite.  In 
the  consideration  of  the  first  exercise  at  the  end  of  this  chapter  the 
student  will  have  opportunity  to  observe  that  such  terms  as  *Hhe 
average  student,'^  ^Hhe  average-sized  apple,"  et  cetera,  do  not  connote 
the  same  to  all.  We  should  therefore  speak  definitely  as  to  which 
average  is  meant  when  the  term  is  used. 

A.  The  Arithmetic  Mean.  The  arithmetic  mean  is  probably  the 
best  understood  A  all  the  averages.  To  many  people  it  is  the  average. 
It  is  easy  to  compute,  is  rigidly  defined,  is  based  on  all  the  measures, 
and  is  well  designed  for  algebraical  manipulation.  Arithmetic  means 
of  different  series  can  be  readily  combined  to  determine  the  arithmetic 
mean  of  the  entire  group.  The  arithmetic  mean  can  be  determined 
if  the  total  and  the  number  of  the  items  are  known,  and  is  useful  in 
case  a  large  weight  is  desired  for  the  extreme  measures.  The  arith- 
metic mean  is  especially  admired  for  its  stabiUty  or  its  reliability. 
If  many  samples  are  drawn  from  some  parent  population,  the  arith- 
metic means  of  the  given  samples  will  usually  show  less  fluctuation 
than  the  other  averages.  We  describe  this  property  by  saying  that 
the  arithmetic  mean  is  a  very  reliable  or  a  very  stable  average. 

Situations  frequently  arise  where  the  emphasis  upon  the  extreme 
measurements  is  undesirable.  For  example,  the  great  wealth  of  one 
man  in  a  community  will  unduly  influence  the  arithmetic  average 
of  the  wealth  of  the  community,  and  thus  the  arithmetic  mean  will 
give  a  distorted  picture  of  the  average  wealth  of  the  community. 
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A  disproportionately  large  salary  paid  to  one  employee  of  a  group 
may  cause  the  average  salary  of  the  group  based  upon  the  arithmetic 
mean  to  give  an  unfair  impression  of  the  salaries  of  the  group  as  a 
whole.  A  hundred-dollar  bill  in  a  collection  plate  may  cause  the 
arithmetic  average  of  the  donations  to  appear  absurdly  large. 

B.  The  Median.  The  median  is  rigidly  defined,  is  easy  to  deter- 
mine. It  is  based  upon  all  the  measures,  each  of  which  has  equal 
influence,  and  it  is  not  unduly  influenced  by  the  extreme  measures. 
It  follows  that  the  median  is  useful  wherever  extreme  items  are  of 
little  importance.  It  is  useful  in  characterizing  groups  of  a  non- 
mathematical  character  which  we  cannot  measure  and  yet  can  ar- 
range them  according  to  size. 

The  median  is  not  so  well  understood  as  the  arithmetic  mean  or 
the  mode,  and  it  is  not  designed  for  further  mathematical  treatment. 
It  shows  a  greater  fluctuation  from  sample  to  sample  and  hence  is 
generally  less  reliable  than  the  arithmetic  mean. 

A  further  objection  to  the  median  is  its  insensitivity.  Thus  we 
can  replace  certain  measurements  of  a  given  group  by  other  measure- 
ments without  having  any  effect  upon  the  median.  Let  us  consider 
the  series  1,  3,  5,  7,  9,  11,  13.  For  this  series  we  have: 

=  7  and  M  =  7 

I  may  replace,  for  example,  the  three  numbers  which  are  larger  than  7 
by  three  other  numbers  which  are  likewise  larger  than  7  and  this 
replacement  will  have  no  effect  upon  the  median.  Thus  the  series  1, 
3,  5,  7,  16,  20,  32  has: 

Md  =  7  and  M  =  12 

The  student  will  discover  other  replacements  that  will  in  no  way 
affect  the  median  but  may  have  tremendous  effect  upon  the  arithmetic 
mean.  Exercise  19  at  the  end  of  this  chapter  is  an  illustration  of  the 
fact  that  shifting  the  positions  of  certain  measurements  of  a  group 
may  have  no  effect  whatever  upon  the  median  provided  the  median 
point  is  not  crossed. 

C.  The  Mode.  Though  the  technical  term  may  not  be  well 
known,  the  concept  of  the  mode  is  well  understood  and  easily  com- 
prehended. It  is  probably  what  the  editors  of  our  newspapers  have 
in  mind  when  they  speak  of  ^*the  average  citizen. Like  the  median, 
it  is  not  greatly  influenced  by  the  extreme  variates.   Though  the 


DISCUSSION  AND  CRITICISM 


lOl 


true  mode  is  difficult  to  determine,  yet  the  term  mode  is  so  important 
that  even  an  approximate  mode  is  often  satisfactory.  An  approxi- 
mate mode  is  not  difficult  to  determine,  and  it  owes  its  importance 
to  the  fact  that  it  is  located  in  the  region  where  the  frequency  is  most 
dense.  It  shows  the  most  frequent  measure.  For  a  clothing  mer- 
chant, the  mode  of  a  distribution  of  chest  measurements  is  the  impor- 
tant average. 

It  frequently  happens  that  a  distribution  has  no  well-defined  mode, 
or  there  may  be  several  apparent  modes.  The  mode  therefore  has  no 
meaning  unless  there  is  a  decided  central  tendency.  The  mode  is  also 
insensitive. 

D.  The  Geometric  Mean.  The  geometric  mean  is  based  on  all 
the  measures,  is  rigidly  defined,  is  suited  to  algebraic  manipulation, 
is  not  unduly  influenced  by  extreme  measures,  and  gives  equal  weight 
to  equal  rates  of  change.  It  may  be  appropriately  used  when  emphasis 
is  on  the  ratio  between  two  quantities  rather  than  on  their  absolute 
difference. 

The  objections  to  the  geometric  mean  are  that  it  is  not  well  under- 
stood, is  difficult  to  compute,  and  is  difficult  for  the  non-mathema- 
tical student  to  comprehend. 

It  is  evident  that  no  one  measure  of  central  tendency  can  be  con- 
sidered as  the  best.  Each  measure  is  useful  in  shedding  some  light 
upon  a  given  problem,  and  the  best  selection  can  be  made  only  by 
the  experienced  statistician  for  the  particular  purpose  he  has  in 
mind.  The  values  of  the  averages  considered  depend  entirely  upon 
the  discrimination  with  which  they  are  used  and  interpreted.  The 
arithmetic  mean  is  perhaps  the  most  useful.  The  ease  of  its  computa- 
tion, its  wide  uses  in  later  applications,  and  its  familiarity  to  the 
general  reader  make  it  highly  serviceable  in  statistical  work. 

EXERCISES 

1.  What  average  is  meant  in  each  of  the  following;  the  average  student? 
the  average  citizen?  the  average  amount  of  material  in  a  dress  pattern?  the 
average-sized  apple?  the  average  annual  rainfall?  the  average  price  of  wheat? 
the  average  ability  in  arithmetic?  the  average  height?  the  average  length  of  life? 
the  overage  speed  of  a  train  between  two  stops?  the  average  salary  of  teachers 
in  a  state?  the  average  number  of  bushels  of  corn  per  acre  in  a  nation? 

2.  A  college  student  carries  15  hours  of  class  work  per  week  and  make* 
the  grades  listed  in  the  following  table.   What  is  his  average  grade? 
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Grades  of  Student  Carrying  15  Hours  of  Class  Work 


Course 

Semester  Hours 

Grades  in 

English 

2 

88 

Mathematics 

5 

96 

Language 

5 

80 

Science 

3 

78 

Total 

15 

3.  A  man  has  $10,000  invested  at  5  per  cent,  $5,000  at  6  per  cent,  and 
$3,000  at  8  per  cent.   What  is  his  average  rate  of  interest? 

4.  The  following  are  the  distributions  of  the  scores  of  334  Freshmen  on 
an  achievement  test  in  English  given  at  Bucknell  University  in  September, 
1929.  In  (a)  the  class  width  is  15,  whereas  in  (b)  the  class  width  is  10. 
What  are  the  class  boundaries?  Compute  M  for  (a)  and  (b). 


Scores  of  334  Students  in  English 
(a)  (b) 


X 

fix) 

X 

fix) 

47 

3 

45.5 

1 

62 

7 

55.5 

3 

77 

15 

65.5 

6 

92 

20 

75.5 

12 

107 

22 

85.5 

9 

122 

37 

95.5 

14 

137 

45 

105.5 

13 

152 

41 

115.5 

22 

167 

55 

125.5 

24 

182 

27 

135.5 

28 

197 

30 

145.5 

24 

212 

11 

155.5 

34 

227 

15 

165.5 

42 

242 

6 

175.5 

25 

185.5 

15 

Total 

334 

195.5 
205.5 

20 
18 

215.5 

3 

225.5 

12 

235.5 

5 

245.5 

4 

Total 

334 
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6.  Compute  the  medians  for  the  distributions  of  Exercise  4. 

6.  Compute  the  approximate  modes  by  formula  (5)  for  the  distributions 
(a)  and  (b)  of  Exercise  4. 

7.  In  the  following  table  deaths  from  collisions  of  automobiles  with  rail- 
road trains  and  street  cars  are  not  included. 

Automobile  Fatalities  in  the  Entire  Registration 
Area  in  Continental  United  States,  1911-1930^ 


Year 


1911 
1912 
1913 
1914 
1915 

1916 
1917 
1918 
1919 
1920 


Number  of 
Deaths 


1,291 
1,758 
2,488 
2,826 
3,978 

5,193 
6,724 
7,525 
7,968 
9,103 


Number  of 
Deaths 

10,168 
11,666 
14,411 
15,528 
17,571 

18,871 
21,160 
23,765 
27,066 
29,080 


Make  a  chart  representing  these  data.  Find  the  geometric  mean  of  the 
annual  rates  of  increase.  Also  find  the  geometric  mean  of  the  number  of 
deaths. 

8.  The  following  table  gives  the  deaths  from  tuberculosis  by  ages.  Note 
that  the  class  intervals  are  not  all  equal.  Using  the  result  of  the  theorem 
in  Exercise  7  on  page  74,  find  the  mean  age  of  death  from  this  cause. 


Deaths  from  Tuberculosis  by  Ages^ 


Age  of  Death 

Number  Dying 
fix) 

Age  of  Death 

Number  Dying 
fix) 

0-  4 

1,356 

30-34 

8,776 

5-  9 

537 

35-44 

15,456 

10-14 

1,278 

45-54 

11,060 

15-19 

6,300 

55-64 

7,455 

20-24 

10,911 

65-74 

4,788 

25-29 

10,349 

75-84 

1,866 

*  The  data  are  taken  from  Statistical  Abstract  of  the  United  States^  1936,  p.  367. 

2  The  data  are  taken  from  Mortality  Statistics,  1928,  p.  160.  The  original 
data  have  been  altered  somewhat,  e.g.,  the  Bureau's  final  classification  was  **75 
and  over." 
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9.  In  Exercise  8  we  combined  the  number  of  deaths  from  0  to  4  in- 
clusive into  a  single  group.  We  felt  justified  in  doing  this  because  in  so 
doing  we  did  not  conceal  any  outstanding  facts.  However,  in  the  accom- 
panying table  such  a  procedure  would  do  violence  to  some  outstanding 
facts. 


Deaths  from  Diphtheria  by  Ages 


Age  of  Death 

Number  Dying 
fix) 

Age  of  Death 

Number  Dying 
fix) 

Under  1 

602 

20-24 

89 

1-  2 

1,133 

25-29 

56 

2—  3 

1,183 

30-34 

67 

3-  4 

1,112 

35-44 

110 

4-  5 

913 

45-54 

60 

5-  9 

2,290 

55-64 

52 

10-14 

435 

65-74 

22 

15-19 

118 

75-84 

12 

Using  the  result  of  the  theorem  in  Exercise  8  on  page  74,  find  the  mean 
age  of  death  from  this  cause.  Also  find  the  median  age  of  death. 

10.  The  total  number  of  divorces  granted  in  the  continental  United 
States  increased  from  33,461  in  1890  to  195,939  in  1928.  Assuming  the 
annual  rate  of  increase  was  constant,  find  its  value.  From  this  result, 
estimate  the  number  of  divorces  granted  in  the  years  1895,  1900,  1905,  and 
1925.  (See  Statistical  Abstract  of  the  United  States,  1930,  p.  91.)  The  actual 
numbers  given  are:  1895,  40,387;  1900,  55,751;  1905,  67,976;  1925, 
170,952. 

11.  For  $1  a  person  purchased  each  of  the  following  amounts  of  the 
given  articles: 

butter,  3  pounds  potatoes,  40  pounds 

sugar,  20  pounds  coffee,  4  pounds 

Find  the  average  number  of  pounds  for  $1  and  the  average  price  per  pound. 

12.  Prove  that  the  product  of  the  ratios  of  each  of  N  measures  to  their 
geometric  mean  is  equal  to  unity. 

13.  Prove  that  the  geometric  mean  of  the  ratios  of  corresponding 
measures  in  two  series  of  N  measures  each  is  equal  to  the  ratio  of  their 
geometric  means. 

14.  The  following  table  gives  the  number  of  pounds  of  sugar  that  could 
be  bought  for  $1  in  the  given  years: 

^  The  data  are  taken  from  Mortality  Statistics,  1928,  p.  150. 


DISCUSSION  AND  CRITICISM 


105 


Pounds  of  Sugar  for  $1,  191&-1922 


Year 


Pounds  of  Sugar 
Bought  for  %l 


1918 
1919 
1920 
1921 
1922 


10.3 
8.8 
5.2 
12.5 
13.7 


What  is  the  average  price  per  pound  during  this  period?  Get  two  answers. 

15.  The  following  distributions  are  of  eggs  from  Barred  Plymouth  Rock 
pullets.  The  measurements  of  length  were  recorded  to  the  nearest  milli- 
meter and  those  of  breadth  to  the  nearest  half  a  millimeter.  Are  these 
tables  consistent  with  our  theory?   What  are  the  class  boundaries? 


Lengths  and  Breadths  of  Random  Sample  of  450 

Eggs  from  450  Pullets  ^ 
(a)  (b) 


Length  in 
Millimeters 
X 


Breadth  in 
Millimeters 
X 


49.5 
50.5 
51.6 
52.5 
53.5 
64.5 
55.5 
56.5 
57.5 
58.5 
59.5 
60.6 
61.5 
62.6 
63.5 
64.6 
65.5 
66.5 
67.5 


1 
1 
7 
22 
36 
71 
68 
77 
78 
35 
29 
10 
4 
3 
6 
1 
0 
0 
1 


38.25 
38.75 
39.25 
39.75 
40.25 
40.75 
41.25 
41.75 
42.25 
42.75 
43.25 
43.75 
44.25 
44.75 
45.25 
45.75 
46.25 


Total 


450 


2 
4 
9 
18 
41 
52 
41 
65 
73 
48 
41 
26 
15 
7 
5 
2 
1 


Compute  M  for  the  distributions  (a)  and  (b). 


1  The  data  are  taken  from  Raymond  Pearl  and  F.  M.  Surface,  A  Biometrical 
Study  of  Egg  Production  in  the  Domestic  Fowl,  Part  III,  p.  183. 
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16.  Find  the  medians  for  (a)  and  (b)  in  Exercise  15. 

17.  The  number  of  births  during  a  year  is  1/48  of  the  population  at  the 
beginning  of  the  year  and  the  number  of  deaths  during  a  year  is  1/60  of 
the  population  at  the  beginning  of  the  year.  Find  the  number  of  years 
for  the  given  population  to  be  doubled. 

18.  Compute  column  3  for  these  data,  and  note  that  the  ratio  is  approxi- 
mately constant.  Find  the  geometric  mean  of  the  expenditures. 


Expenditure  for  Public  Schools  in  the  United  States,  1909-1919 


Year 

Expenditure 
(millions) 
X 

Ratio  of  Each 
Item  to  One  above 

LogX 

1909-1910 

401.4 

1910-1911 

426.3 

1911-1912 

446.7 

1912-1913 

482.9 

1913-1914 

521.5 

1914-1915 

555.1 

1915-1916 

605.5 

1916-1917 

640.7 

1917-1918 

702.2 

1918-1919 

763.7 

19.  Here  are  data  for  two  groups  of  laborers.  Find  the  median  wage  for 
each  group.  Find  the  arithmetic  mean  wage  of  each  group.  Which  is 
the  "  better-paid group? 

Wages  of  Two  Groups  of  Laborers 


Wages  per  Week 
(dollars) 

Frequencies 

Group  A 

Group  B 

9.00-9.49 

2 

2 

8.50-8.99 

2 

2 

8.00-8.49 

10 

10 

7.50-7.99 

39 

39 

7.00-7.49 

20 

1 

6.50-6.99 

16 

16 

6.00-6.49 

6 

6 

5.50-5.99 

4 

4 

5.00-5.49 

1 

20 

Total 

100 

100 

D.ISCUSSION  AND  CRITICISM 


107 


20.  The  population  of  New  York  State  increased  at  a  constant  annual 
rate  from  9,114,000  in  1910  to  10,385,000  in  1920.  What  was  the  annual 
rate  of  increase?  Assuming  the  same  annual  rate  to  continue  during  the 
period  1920  to  1925,  estimate  the  population  of  New  York  State  in  1925. 
Compare  your  estimate  with  the  count  of  the  State  Census  which  was 
11,161,000. 

21.  The  number  of  bacteria  in  a  certain  culture  was  found  to  be  4(10^) 
at  noon  of  one  day.  At  noon  the  next  day,  the  number  was  found  to  be 
9(10^).  If  the  number  increased  at  a  constant  hourly  rate,  how  many 
bacteria  were  there  at  midnight? 

22.  The  price  of  an  automobile  decreased  in  value  at  a  constant  annual 
rate  from  $1,000  to  $300  in  five  years.  What  was  the  annual  rate  of  de^ 
crease?   What  was  the  value  of  the  car  at  the  end  of  three  years? 


24.  The  production  of  Portland  Cement  in  the  United  States  increased 
from  99  million  barrels  in  1921  to  176  miUion  barrels  in  1928.  Assuming 
that  the  production  increased  at  a  constant  annual  rate,  find  the  average 
annual  rate  of  increase. 

25.  The  population  of  Detroit  increased  at  a  constant  annual  rate  from 
465,700  in  1910  to  993,700  in  1920.  What  was  the  average  annual  rate 
of  increase?  Assuming  the  same  annual  rate  to  continue  during  the  period 
1920  to  1930,  estimate  the  population  of  Detroit  in  1930.  Compare  your 
estimate  with  the  actual  census  report  which  gave  1,568,700. 

26.  If  in  1930  the  city  of  Detroit  built  a  water  system  sufficient  to  supply 
a  population  of  2,500,000,  how  many  years  may  elapse  before  the  city 
finds  it  necessary  to  enlarge  its  water  system?  Base  your  estimate  upon 
the  three  census  reports.   [See  Exercise  25.] 

Hint:  If  a  is  the  annual  rate  of  increase  the  first  decade,  and  b  is  the 
annual  rate  of  increase  the  second  decade,  the  average  annual  rate  is 


23. 


Year 


Production 
(Thousands) 
X 


1908 
1909 
1910 
1911 
1912 
1913 
1914 
1915 
1916 
1917 


65 
131 
187 
210 
378 
485 
569 
970 
1618 
1874 


The  accompanying  table  gives  the  pro- 
duction of  motor  vehicles  in  the  United 
States  for  the  years  1908  to  1917  inclusive. 
Find  M  and  Mg  of  the  production. 


a;  =  \/(l  +  a)(l  +  b)  -  I. 
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27.  During  five  successive  years  a  certain  investment  earned  5  per  cent, 
6  per  cent,  6.5  per  cent,  4  per  cent,  and  3.5  per  cent.  What  was  the 
average  annual  rate  of  increase? 

28.  A  does  a  unit  of  work  in  20  minutes,  and  B  does  a  unit  of  work  in 
24  minutes.  What  is  their  average  rate  of  working? 

29.  The  sales  record  of  a  certain  firm  showed  the  following  items: 
800  articles  at  10  cents;  400  articles  at  25  cents;  300  articles  at  50  cents. 
What  was  the  average  price  per  article? 

30.  A  man  travels  two  miles,  the  first  at  a  miles  an  hour  and  the  second 
at  h  miles  an  hour.  Show  that  his  average  rate  is 

a  +  h 

miles  an  hour.   What  type  of  average  is  this? 

31.  The  annual  wages  earned  by  a  group  of  423  chief  wage  earners  in 
families  are  given  in  the  following  table.  (Houghteling,  Leila:  The  Income 
and  Standard  of  Living  of  Unskilled  Laborers  in  Chicago ,  p.  27.)  Compute 
M ,  Mdi  and  Mo  (by  fitting  a  parabola)  for  this  distribution. 


Class 

X 

fix) 

$800-  899 

6 

900-  999 

11 

1000-1099 

40 

1100-1199 

50 

1200-1299 

63 

1300-1399 

63 

1400-1499 

81 

1500-1599 

45 

1600-1699 

24 

1700-1799 

20 

1800-1899 

6 

1900-1999 

7 

2000-2099 

2 

2100-2199 

4 

2200-2299 

0 

2300-2399 

1 

Total 

423 

32.  Milk  is  standardized  according  to  its  butterfat  content.  For  ex- 
ample, ordinary  legal  milk  is  a  3  per  cent  milk,  that  is,  3  per  cent  of  its 
weight  is  butterfat.  If  a  farmer  mixes  8  gallons  of  3  per  cent  milk,  10  gal- 
lons of  2.9  per  cent  milk,  5  gallons  of  3.5  per  cent  milk,  4  gallons  of  5.3  per 
cent  milk,  what  per  cent  butterfat  is  the  mixture?  What  type  of  average 
is  this? 
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33.  Compute  Mg  for  the  data  of  Exercise  12,  page  57. 

34.  Which  measure  of  central  tendency  would  you  use  to  summarize 
the  frequency  distribution  of  the  following  cases,  and  why? 

(1)  Income  of  parents  of  Bueknell  students. 

(2)  Amount  spent  for  food  by  Bueknell  students. 

(3)  Number  of  hours  per  week  spent  in  outside  preparation  by  Bueknell 
students. 

(4)  Height  of  Bueknell  men. 

(5)  Weight  of  Bueknell  women. 

35.  Exercise  12,  page  74,  gives  three  distributions  of  weekly  wages  in 
the  clothing  industry.   Find  Mo  for  each  distribution. 

36.  The  annual  salaries  received  by  a  group  of  Senior  federal  employees 
are  given  in  the  following  table.  (White:  Public  Administration,  page  290.) 
Note  that  the  class  intervals  are  not  all  equal. 


Clans 

X 

fix) 

$720 

and  under 

1840 

2 

840 

a 

11 

900 

5 

900 

ti 

It 

1000 

18 

1000 

a 

li 

1100 

123 

1100 

ii 

n 

1200 

369 

1200 

n 

li 

1320 

1208 

1320 

n 

u 

1440 

437 

1440 

<( 

it 

1560 

63 

1560 

i( 

it 

1800 

74 

1800 

n 

ti 

2000 

30 

2000 

n 

ft 

2500 

5 

Total 

2334 

Compute  My  M^  and  Mo 
(by  fitting  a  parabola)  for 
this  distribution.  Which 
average  is  the  most  appro- 
priate ? 


37.  A  hardware  company  makes  7%  on  the  invested  capital  the  first 
year.  The  profit  is  added  to  the  original  capital,  and  9%  is  made  on  the 
total  investment  the  second  year.  Proceeding  in  this  way,  the  profits 
are  10%  the  third  year,  12%  the  fourth  year,  and  15%  the  fifth  year. 
What  is  the  average  rate  during  the  five-year  period? 

38.  The  value  in  millions  of  dollars  of  exports  from  the  U.S.  in  the  given 
years  are  shown  in  the  following  table.  Compute  the  geometric  mean  of 
the  values  of  the  exports. 


Year 

Value  of  Exports 

Year 

Value  of  Exports 

1885 

742.2 

1905 

1518.6 

1890 

857.8 

1910 

1745.0 

1895 

808.5 

1915 

2768.6 

1900 

1394.5 
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39.  A  man  wishes  to  travel  two  miles,  the  first  at  30  miles  an  hour  and 
the  second  at  such  a  cspeed  that  his  average  speed  over  the  two  mile  course 
will  be  60  miles  an  hour.  At  what  speed  must  he  travel  the  second  mile? 

40.  (For  students  who  are  familiar  with  determinants.)^  Let  (Zi,  Fi), 
(Za,  Y2)  and  (X3,  Yz)  be  the  coordinates  of  three  points  on  the  parabola 

B 

Y  =  AX^  +  BX  +  C.  Show  that  the  quotient,  —  ~?  is  given  by 


Fi  1 

Xi^       Y2  1 

Xa'     Fs  1 


^  2 


X2 


2A 

Fi  1 
F2  1 
F3  1 


Thus,  if  X2  is  the  crude  mode  and  Xi  and  Xz  the  class  marks  of  the 
adjacent  classes  with  Xi  <  Z2  <  X3,  the  above  quotient  gives  the  value 
of  X  of  the  approximate  mode. 

41.  If  the  class  interval  is  taken  as  a  unit  and  x/  =  —  i  =  1,  2,  3, 

w 

show  that  we  may  obtain  from  Exercise  40  above  the  value  of  x'  of  the 
approximate  mode  to  be 


1 

Fi 

1 

-  1 

Fi 

1 

0 

F, 

1 

4-  2 

0 

F, 

1 

1 

F, 

1 

1 

F, 

1 

when  X2  =  0  is  taken  at  the  crude  mode. 

42.  Supposing  the  frequencies  of  the  X  values  are  the  terms  of  the 
expansion  {q  +  p)^  as  indicated  in  the  table,  find  M  if  (q  +  p)  =  1. 


X 

m 

0 

qn 

1 

nq'^'^^p 

2 

1 
1 
1 

n\n  -  1) 

2      ^  ^ 

1 

1 

n 

1 

43.  Find  the  arithmetic,  the  geometric,  and  the  harmonic  means  of 
the  numbers:  1,  2,  4,  8,  .  .  .,2'*. 

n  +  1 

44.  Show  that  the  median  of  the  numbers  1,  2,  3,  .  .  . ,  n  is 


2 


1  I  am  indebted  to  Dr.  C.  W.  Bruce  for  suggesting  this  exercise. 
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30.  THE  INADEQUACY  OF  MEASURES 
OF  CENTRAL  TENDENCY 

The  preceding  chapters  have  called  attention  to  the  necessity  of 
inventing  summary  numbers  to  characterize  masses  of  numerical 
data.  Chapter  3  has  dealt  with  certain  terse  expressions,  single 
magnitudes,  by  means  of  which  we  may  obtain  an  understanding  of 
the  typical  characteristics  of  the  group  as  a  whole.  They  represent 
the  acme  of  condensation.  The  arithmetic  mean,  for  example,  repre- 
sents an  average  size  of  the  measures,  and  is  the  value  such  that  the 
algebraic  sum  of  the  deviations  of  the  measures  from  it  is  zero.  All 
must  admit  the  value  of  the  measures  of  central  tendency,  but  we 
must  come  to  realize  their  insufficiency.  Two  groups  of  measures 
may  have  the  same  mean  ^  and  yet  differ  widely.  Consider  the  two 
groups  below: 

Group  I  Group  II 

42  10 

45  22 

50  (the  mean)  50  (the  mean) 

55  78 

58  90 

The  numbers  in  Group  I  are  concentrated  about  their  mean, 
whereas  those  of  Group  II  are  widely  scattered.  Similarly,  we  may 
have  two  groups  of  laborers  with  the  same  mean  salary  and  yet  their 
distributions  may  differ  widely.  The  mean  salary  may  not  be  so 
important  a  characteristic  as  the  variation  of  the  items  from  the 
mean.  To  the  student  of  social  affairs,  the  mean  income  is  not  so 
vitally  important  as  to  know  how  this  income  is  distributed.  Are  a 
large  number  receiving  the  mean  income  or  are  there  a  few  with 
enormous  incomes  and  millions  with  incomes  far  below  the  mean? 

1  In  what  follows,  when  the  term  mean  is  used  without  a  quaUfying  adjective, 
the  arithmetic  mean  is  meant. 

Ill 
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Figures  9,  10,  and  11  represent  frequency  distributions  with  some 
of  the  characteristics  we  wish  to  emphasize  here.  The  two  curves  in 
(a)  represent  two  distributions  with  the  same  mean,  M,  but  with 
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different  dispersions.  The  two  curves  in  (b)  represent  two  distribu^ 
tions  with  the  same  dispersion  but  with  unequal  means,  Mi  and  M2. 
Finally,  (c)  represents  two  distributions  with  unequal  means  and 
unequal  dispersions. 

The  measures  of  central  tendency  are  therefore  insufficient.  They 
must  be  supported  by  and  supplemented  with  other  measures.  In 
this  chapter,  we  shall  be  especially  concerned  with  the  measures  of 
variabiUty,  or  spread,  or  dispersion.  A  measure  of  dispersion  is 
designed  to  state  the  extent  to  which  the  individual  measures  differ 
on  the  average  from  the  mean.  *  In  measuring  dispersion  we  shall  be 
interested  in  the  amount  of  the  variation  or  its  degree  but  not  in  the 
direction.^  For  example,  a  measure  of  4  inches  below  the  mean  has 
just  as  much  dispersion  as  a  measure  of  4  inches  above  the  mean. 
The  amount  of  variability,  or  absolute  variability,  will  be  expressed 
in  concrete  units,  the  same  units  that  are  used  for  the  original  variates, 
while  the  degree  or  relative  variability ,  will  be  expressed  in  abstract 
numbers  or  ratios.  A  measure  of  absolute  variation  is  useful  in 
describing  a  single  frequency  distribution,  but  if  two  different  dis- 
tributions are  to  be  compared,  difficulties  are  encountered. 

The  real  significance  of  the  statements  of  the  paragraph  above  will 
be  comprehended  as  we  proceed  further  into  the  chapter.  The 
computation  of  the  measures  of  variation  for  several  distributions 
will  convince  us  that  a  measure  of  absolute  variation  is  significant 
only  in  proportion  to  the  size  of  the  thing  varying.  Therefore,  for 
the  comparison  of  the  variation  in  two  distributions,  we  shall  find 
it  necessary  to  define  certain  measures  of  relative  variability. 

There  are  several  measures  of  absolute  variability  to  which  we  shall 
give  attention.  They  are  (1)  the  range,  (2)  the  semi-interquartile 
range,  (3)  the  mean  deviation,  (4)  the  standard  deviation,  and 
(5)  the  probable  error.'  As  to  measures  of  relative  variability,  we 
shall  call  attention  to  several,  but  we  shall  express  our  preference 
for  the  coefficient  of  variation^  an  invention  of  Professor  Karl  Pearson. 

1  Generally  from  the  mean;  infrequently  from  other  measures  of  central 
tendency. 

*  The  question  of  the  direction  of  the  variation  will  be  answered  in  Chapter  5 
in  connection  with  the  skeioness. 

8  A  more  extensive  treatment  of  probable  error  will  be  found  in  Chapter  12. 
Also,  see  Section  35. 
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EXERCISES 


L  The  heights  of  11  men  were  61,  64,  68,  69,  67,  68,  66,  70,  65,  67,  and 
72  inches.  If  the  shortest  man  is  omitted,  what  is  the  percentage  change 
in  the  range  ? 

2.  The  weights  of  11  forty-year-old  men  were  148,  154,  158,  160,  161, 
162,  166,  170,  182,  195,  and  236  pounds.  If  the  heaviest  man  is  omitted, 
what  is  the  percentage  change  in  the  range? 

3.  The  range  of  the  heights  of  the  11  men  considered  in  Exercise  1  is 
72  —  61  =  11  inches  and  the  range  of  the  weights  of  the  men  considered 
in  Exercise  2  is  236  —  148  =  88  pounds.  Can  you  determine  from  this 
information  which  shows  the  greater  variation,  the  11  measurements  of 
height  or  the  11  measurements  of  weight? 

4.  Find  the  ratio  of  the  range  to  the  mean  in  Exercise  1  and  Exercise  2. 
If  these  ratios  are  used  to  measure  the  relative  variations,  can  you  answer 
the  question  proposed  in  Exercise  3? 

6.  A  sample  of  1515  college  men  was  measured  as  to  height.  Their 
mean  height  was  found  to  be  67.9  inches.  What  would  you  consider  a 
reasonable  variation  on  either  side  of  the  mean  for  such  a  set  of  data? 

6.  A  sample  of  1515  college  men  was  measured  as  to  weight.  Their 
mean  weight  was  found  to  be  138.9  pounds.  What  would  you  consider  a 
reasonable  variation  on  either  side  of  the  mean  for  such  a  set  of  data? 

7. 


X 

A 
fix) 

B 

m 

2.5 

1 

0 

7.5 

2 

0 

12.5 

3 

0 

17.5 

5 

1 

22.5 

7 

3 

27.5 

8 

14 

32.5 

9 

17 

37.5 

9 

17 

42.5 

8 

14 

47.5 

7 

3 

62.5 

5 

1 

57.5 

3 

0 

62.5 

2 

0 

67.5 

1 

0 

Total 

70 

70 

Construct  frequency  polygons  on  the  same 
sheet  for  distributions  A  and  B.  Compare  their 
arithmetic  means,  their  medians,  and  their 
modes.  Do  the  measures  of  central  tendency 
constitute  a  sufficient  description  of  these 
groups? 


31.   THE  RANGE 

The  simplest  possible  measure  of  the  variation  of  a  group  of  meas- 
ures is  the  range,  that  is,  the  difference  between  the  highest  recorded 
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score  and  the  lowest  recorded  score.  Since  the  range  is  determined 
by  only  the  two  extreme  measures,  it  tells  us  nothing  of  the  distribu- 
tion between  these  extremes;  it  tells  us  nothing  about  the  concentra- 
tion of  the  measures  about  the  center. 

Consider  the  distribution  of  heights  in  Exercise  1  (p.  54)  and  note 
that  the  one  man  in  the  tallest  class  increases  the  range  about  10  per 
cent.  Such  an  erratic  measure  is  of  little  use  for  purposes  of  com- 
parison. We  need  a  more  stable  measure. 


32.  THE  QUARTILE  DEVIATION 

A  measure  of  variation  superior  to  the  range  is  the  quartile  range 
or  half  of  it,  the  semi-interquartile  range,  sometimes  called  the  quartile 
deviation.  The  quartiles  are  the  points  on  the  X-scale  that  divide  the 
distribution  into  four  equal  parts.  Obviously,  there  are  three  quar- 
tiles, the  second  coinciding  with  the  median.  More  precisely  stated, 
the  lower  quartile,  Qi,  is  that  point  on  the  X-scale  such  that  one- 
fourth  of  the  total  frequency  is  less  than  Qi  and  three-fourths  are 
greater  than  Qi.  The  upper  quartile,  Q3,  is  that  point  on  the  X-scale 
such  that  three-fourths  of  the  total  frequency  are  below  Q3  and  one- 
fourth  is  above  it.  Between  Qi  and  Q3,  then,  are  included  one-half 
the  total  frequency.  Since,  under  most  circumstances,  the  central 
half  of  a  distribution  tends  to  be  fairly  typical,  the  quartile  range 
Qz-Qi  affords  a  convenient  measure  of  absolute  variation.  The 
greater  the  quartile  range,  the  greater  the  dispersion. 

It  is  customary  to  use  one-half  the  quartile  range  as  a  measure  of 
dispersion,  and  to  it  is  given  the  name  of  semi-interquartile  range 
We  denote  it  by  Q,  and  hence : 

Q  =  (1) 

We  can  determine  the  quartiles  in  a  manner  similar  to  that  used 
in  the  determination  of  the  median  (see  Section  25,  p.  76).  The  class 
intervals  in  which  the  quartiles  lie  are  called  the  quartile  classes. 

area  wN  represents  N  measures 

u    wN  N 
4  4 
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Let/ 1  and / 3  be  the  frequencies  of  the  lower  and  upper  quartile  classes. 

Let  61  and  63  be  the  lower  boundaries  of  these  classes. 

Let  Til  and  ns  be  the  accumulated  frequencies  of  all  classes  below 

the  lower  and  upper  quartile  classes  respectively. 

w  ==  the  class  width 

N  ==  the  total  frequency 

zi  =  biQi 

Qi  =  the  lower  quartile 


Then  in  Figure  12  we  have: 

area  ABCDQi 

ABbi  +  biCDQi 


From  which: 


and 


niw  +  fi  '  Zi 


wN 

T 

wN 

T" 

wN 


-  ni 


Zi  = 


/i 


Pi  =  61  +  Zi  =  61  + 


(2) 


C  D 


B 


Figure  12 
M 


R 


S 
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In  a  similar  manner,  by  equating  area  ACMRSQ3  to 


3wN 


we  obtain: 


4 


Q3  =  bs  + 


3N 


(3) 


w 


If  the  median  be  designated  by  Q2,  formula  (4)  of  Section  25  (p.  77) 
maybe  written: 


It  should  be  noted  that  the  determination  of  Qi  and  Qz  requires 
that  we  know  the  class  boundaries  of  the  classes  that  contain  Qi 
and  Q3. 

Therefore,  to  determine  Qi  we  must  first  locate  the  class  that 
contains  Qi,  the  Qi  class.  This  done,  we  will  then  know  A^/4,  ni, 
hi,  and/i.  To  locate  the  Qi  class  we  find  then  begin  at  the  lower 
end  of  the  scale  and  add  the  frequencies  of  the  successive  classes  until 
the  lower  boundary  of  the  class  containing  Qi  is  reached.  We  then 
know  ni,  bi,  and  /i,  and  thus  can  immediately  find  Qi.  A  similar 
statement  may  be  made  with  regard  to  Q3. 

The  quartile  points  may  also  be  found  by  simple  analysis  without 
using  formulas  just  as  we  found  the  median,  Q2.  The  method  is 
explained  in  Exercise  8  of  the  next  appearing  list  of  exercises. 

Returning  to  the  data  of  Table  8  (p.  26),  for  an  illustrative  example, 
we  have: 


and  the  three  formulas  may  be  written  in  the  form: 


i  =  1,  2,  3 


N 
4 
SN 


125 


=  31.25 


4 


4 


=  93.75 


The  quartile  class  of  Qi  is  the  class  of  67.5-72.5,  and  the  quartile 
class  for  Q3  is  the  class  77.5-82.5.  Hence: 
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Then: 


bi  =  67.5   and    63  =  77.5 
rii  =  23      and   ns  =  84 
fi     24      and    /a  =  19 

Qi  =  67.5  +  ^31.25  -  23^  ^  ^  gg  32  c.u. 


and 


^  ^  ,  /93.75  -  84\  . 

Qb  =  77.5  +  f   15  =  80.06  c.u. 

This  gives  the  quartile  range  to  be: 

Qs  -  Qi  =  10.84   and   Q  =  5.42  c.u. 

In  other  words,  half  of  the  scores  occupy  a  range  of  10.84  on  the 
centigrade  scale,  almost  equally  distributed  on  either  side  of  the 
median.  For,  by  Section  25  (p.  78),  Md  =  Q2  =  74.60,  and  therefore: 

Afd  -  Qi  =  74.60  -  69.22  =  5.38  c.u. 
Qz-  Md^  80.06  -  74.60  =  5.46  c.u. 

As  previously  stated,  the  quartiles,  and  hence  Q,  are  expressed  in 

Qs  +  Qi 

terms  of  the  original  units,  but  if  we  divide  Q  by  we  have  a 

quartile  coefficient  of  dispersion  which  may  be  used  to  measure  rela- 
tive variation.  This  coefficient  is  a  ratio,  a  pure  number  less  than 
unity  in  value,  and  hence  in  using  it  we  may  compare  the  variations  in 
distributions  of  unhke  units,  as  distributions  of  heights  in  inches  with 
distributions  of  weights  in  kilograms.  Designating  the  quartile 
coefficient  of  dispersion  by  Fg,  we  have: 

Qz  -  Qi 
Qz  +  Qi 

In  case  the  distribution  is  symmetrical: 

from  which  ^      ^ ,      Qz  +  Qi 

=  Md  =   2 — 

and  jT     Qz  —  Qi 


V,  =  ¥r^  (4) 


V. 


2Qi 


In  this  case,  the  distance  from  the  median  to  either  Qs  or  Qi  is  called 
the  probable  deviation,  sometimes  loosely  called  the  probable  error.  In 


THE  QUARTILE  DEVIATION 


119 


other  words,  the  probable  deviation  is  that  distance  which  if  laid  off  on 
either  side  of  the  median  of  a  symmetrical  distribution  will  include  50 
per  cent  of  the  measures.  If  the  distribution  is  not  only  symmetrical 
but  normal  ^  (see  Section  35,  p.  134),  this  distance  is  properly  called  the 
probable  error. 


1.  Find  Vq  for  the  distribution  of  college  algebra  grades  as  described 
in  Table  8  (p.  26).   State  a  use  of  this  result. 

2.  Find  Q  and  Vq  for  the  distributions  of  heights  and  weights  as  de- 
scribed in  Exercise  1  (p.  54).    Give  meaning  to  your  results. 

3.  The  deciles  are  the  points  on  the  X-scale  which  divide  the  distribution 
into  ten  equal  parts.  If  bi,  62,  .  .  .,  69  be  the  lower  boundaries  of  the 
decile  classes;  /i,  f2,  .  .  /g  be  the  frequencies  of  the  decile  classes,  and 
rii,  ris,  .  .  n9  be  the  accumulated  frequencies  in  all  classes  below  the 
respective  decile  classes,  and  if  Z)t  be  the  I'th  decile,  show  that: 


4.  Find  the  deciles  for  the  distributions  of  English  scores  as  described  in 
Exercise  4,  page  102. 

6.  Suggest  some  measures  of  absolute  and  relative  variability  based 
upon  the  deciles. 

6.  The  percentiles  are  the  points  on  the  X-scale  which  divide  the  dis- 
tribution into  one  hundred  equal  parts.  If  61,  ^2,  .  .  .  . ,  ^99  be  the  lower 
boundaries  of  the  percentile  classes;  fi,  fi,  .  .  .,/99  the  frequencies  of  the 
percentile  classes;  and  ni,  112,  .  .  ngg  the  accumulated  frequencies  in  all 
classes  below  the  respective  percentile  classes,  and  if  I\  be  the  ith  percen- 
tile, show  that: 


7.  Find  the  fifth,  the  fifteenth,  and  the  seventy-fifth  percentiles  for  the 
distribution  in  Exercise  2,  page  54. 

8.  The  quartile  points  may  be  determined  by  simple  arithmetic  in  a 
manner  similar  to  that  used  in  finding  the  median.  (See  p.  79.)  Com- 
plete the  outline  on  page  120. 


EXERCISES 


9 


^  A  normal  distribution  is  one  whose  frequency  curve  is  of  the  type  y  =  Ce 
Chapter  12  will  be  concerned  with  normal  distributions. 
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Consider  the  adjacent  distribution.  By  counting  from  the  smaller 
X-values  we  determine  52.5  —  62.5  to  be  the  Qi  class.  Below  this  class 
are  4  +  H  =  15  scores.  We  need  to  move  up  the  scale  above  52.5  a  dis- 
tance Zi  until  we  obtain  10  scores  from  the  32  scores  of  the  Qi  class,  and 
thus  have  15  +  10  =  25,  or  N/4:, 


Distances 


Frequencies 


Class 

X 

fix) 

92.5-102.5 

97.5 

4 

82.5-  92.5 

87.5 

9 

72.5-  82.5 

77.5 

17 

62.5-  72.5 

67.5 

23 

52.5-  62.5 

57.5 

32 

42.5-  52.5 

47.5 

11 

32.5-  42.5 

37.5 

4 

Total 

100 

21^  ^  10 

10  32 
Qi  =  52.5  +  2i  =  (  ) 


^1  =  (  ) 


By  the  method  employed  here,  find  Qa  of  this  distribution. 

9.  What  are  the  limiting  values  of  the  earnings  of  the  middle  half 
of  each  distribution  of  Exercise  12,  page  74? 

10.  Compute  Qi,  Q3,  and  Q  for  the  distribution  of  head-breadths  given 
in  Exercise  2,  page  54.  Does  Md  ±  Q  give  values  coincident  with  Q3 
and  Qi?  Can  you  suggest  a  reason? 

11.  Find  Qs  and  Qi  for  the  distribution  of  Exercise  3,  page  42.  Since 
this  is  a  distribution  of  discrete  variates,  what  meaning  can  you  give  to 
your  computed  values? 

12.  What  are  the  limiting  values  of  the  earnings  of  the  middle  half  of 
the  distribution  given  in  Exercise  31,  page  108? 

13.  Does  Md  ±  Q  for  the  distribution  of  Exercise  12  above  give  values 
coincident  with  Q3  and  Qi?   Can  you  suggest  a  reason? 

14.  In  what  units  are  the  following  constants  measured:  M,  Md,  Mo^ 
Range,  Qi,  Q3,  Q,  and  Vq? 

15.  Derive  the  formulas  for  Qi  and  Q3  by  the  method  of  Exercise  8, 
above. 


33.   THE  MEAN  DEVIATION 

In  the  previous  chapter  we  have  shown  it  to  be  a  property  of  the 
arithmetic  mean  that  the  algebraic  sum  of  the  deviations  from  it  is 
zero.  The  algebraic  sum  of  the  deviations  about  any  other  measure 
of  central  tendency  will  probably  be  small.   Further,  we  have  em- 
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phasized  in  the  beginning  of  this  chapter  that  in  measuring  variability 
we  are  interested  in  the  amount  and  not  in  the  direction  of  the  varia- 
tion. And,  too,  whatever  constant  is  used  to  measure  variability 
should  be  one  that  is  based  upon  all  the  original  measures. 

These  considerations  lead  us  to  define  the  mean  deviation  as  the 
mean  of  the  absolute  values  ^  of  the  deviations  of  the  separate  meas- 
ures from  some  measure  of  central  tendency.  Although  the  mean  de- 
viation is  a  minimum  when  taken  about  the  median  ^  —  which  is  a 
splendid  argument  for  insisting  upon  its  being  taken  about  that 
average  —  yet  it  is  more  frequently  taken  about  the  mean.  If  X 
is  any  measure  and  M  the  mean,  then: 

TiT  r.    u    .  Tir      2  |X  ~  M\f{x) 

M,D.  about  M  =  — ■  ^7 — 

We  have  previously  designated  the  deviation  of  any  measure  from 
the  mean  by  x  (see  Figure  1,  p.  73),  that  is: 

X  =  X  ~  M 

^\x\fix) 


hence  M.D.  about  M  = 


N 


Similarly,  we  may  define  the  mean  deviation  about  the  median  by 
the  formula: 

M.D.  about  Md  =  — '  N 

Of  course  if  the  numbers  are  not  arranged  in  a  frequency  distribution, 
then  considering  each  frequency  as  unity  we  have: 

M.D.  about  M  =  — —-^  =  — ■ —  ■ 

N  N 

M.D.  about  Md  =  — ■ —  ■ 

N 

Corresponding  coefficients  of  relative  dispersion  may  be  found  by 

1  The  magnitude  represented  by  a  signed  number  is  called  the  absolute  value 
or  the  numerical  value  of  the  number,  and  is  indicated  by  placing  a  vertical  line 
on  either  side  of  the  number.  Thus  the  absolute  value  of  +  5  and  of  —  5  is  5; 
in  symbols,  |  4-  5  |  =  |  —  5  |  =  5. 

*  Yule  and  Kendall,  op.  cii.,  p.  145. 
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dividing  any  mean  deviation  by  the  average  about  which  it  is  taken. 
Thus: 

M.D,  about  M 

V  M.D.  about  M  ^  

For  an  illustrative  example  we  shall  compute  the  mean  deviation 
about  the  mean  for  the  distribution  of  the  grades  in  college  algebra. 
In  Section  22  (p.  62)  we  computed  the  mean  to  be: 

M  =  74.48  c.u. 
We  then  have:  x  =  X-  M  =  X-  74.48 


Table  22.  Computing  M,D,  about  M  for  the  Grades  in 

College  Algebra 


X 

fix) 

\x\  =  IX  -  M\ 

1^1  •  fix) 

95 

4 

20.52 

82.08 

90 

6 

15.52 

93.12 

85 

12 

10.52 

126.24 

80 

19 

5.52 

104.88 

75 

37 

0.52 

19.24 

70 

24 

4.48 

107.52 

65 

11 

9.48 

104.28 

60 

6 

14.48 

86.88 

55 

4 

19.48 

77.92 

50 

2 

24.48 

48.96 

Total 

125 

851.12 

M.D.  about  M  =  ^^^'^^  =  6.81  c.u. 

Fm.Z).  about  M  =  =  0.09143  =  9.1% 

Of  the  three  measures  of  absolute  variability  that  we  have  thus 
far  considered,  the  mean  deviation  is  the  only  one  which  has  con- 
sidered the  deviations  of  all  the  individual  members  from  a  given 
average.  The  range  and  the  semi-interquartile  range  are  distances 
that  are  not  based  upon  the  consideration  of  all  the  members  of  the 
distribution.  The  mean  deviation,  however,  is  based  upon  all  the 
members  of  the  group,  is  rigidly  defined,  is  readily  computed,  and^  is 


THE  MEAN  DEVIATION 


123 


not  difficult  to  comprehend.  It  gives  due  weight  to  the  extreme  items, 
and  is  an  especially  good  measure  to  use  with  economic  data.  The 
artificial  step  of  ignoring  the  signs  of  the  deviations,  of  course, 
renders  it  useless  in  further  mathematical  treatment. 

It  is  a  property  of  approximately  normal  distributions  that  the 
interval 

M  =t  {M.D.  about  M) 

includes  about  58  per  cent  of  the  total  frequency.  For  this  distribu- 
tion of  college  algebra  grades  this  interval  extends  from  74.5  —  6.8 
to  74.5  +  6.8,  that  is,  from  67.7  to  81.3. 


23 


37 


14 


72.5 


77.5 


82.5 


81.3 


By  constructing  a  portion  of  the  histogram  of  Table  8  and  recalling 
that  a  class  frequency  is  proportional  to  the  area  of  the  rectangle, 
we  find  that 

81.3  -  77.5 


72,5  -  67.7 


(24)  +  37  + 


(19) 


or 


23  +  37  +  14  =  74 


scores  lie  in  this  interval.  This  is  59  per  cent  of  the  total  frequency, 
125,  which  checks  the  theory  approximately. 

This  example  illustrates  an  important  function  of  a  measure  of 
dispersion  when  it  is  combined  with  a  measure  of  central  tendency. 
They  give  a  summarized  description  of  the  distribution  because  they 
make  possible  the  determination  of  intervals  that  include  rather 
definite  proportions  of  the  total  frequency.  Thus  ilf d  ib  Q  deter- 
mines an  interval  that  includes  about  N /2  variates  and  M  di  M.D, 
determines  an  interval  that  includes  about  3A^/5  variates. 
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EXERCISES 

1.  Find      S  \x\j  and  M.D.  about  M  for  each  set  of  numbers; 

(a)  (b)  (c) 


X 

X 

\x\  X 

X 

3 

62 

5 

68 

13 

74 

20 

76 

27 

88 

58 

94 

X 

X 

124 

146 

162 

178 

190 

220 

2.  Statistical  data  of  the  United  States  Department  of  Agriculture 
show  the  following  average  yields  in  bushels  per  acre  for  the  three  specified 
crops.   Compute  M,D,  about  M, 


Year 

Wheat 

Rye 

Oats 

Year 

Wheat 

Rye 

Oats 

1923 

13.3 

11.3 

30.5 

1928 

15.4 

11.7 

32.9 

1924 

16.0 

15.0 

34.0 

1929 

13.0 

11.4 

29.3 

1925 

12.8 

11.3 

31.9 

1930 

14.2 

12.8 

32.2 

1926 

14.7 

10.3 

26.6 

1931 

16.3 

10.4 

28.1 

1927 

14.7 

15.1 

27.1 

1932 

13.0 

12.2 

30.1 

3.  Complete  the  following  table.  Find  M.D.  about  M.  What  per 
cent  of  the  total  frequency  is  included  in  the  interval  M  ±  M.D,1 


Class 

X 

fix) 

xjix) 

X 

xf{x) 

\xm\ 

92.5-102.5 

97.5 

4 

82.5-  92.5 

87.5 

11 

72.5-  82.5 

77.5 

32 

62.5-  72.5 

67.5 

25 

0 

52.5-  62.5 

57.5 

15 

42.5-  52.5 

47.5 

8 

32.5-  42.5 

37.5 

5 

Total 

100 

4.  Each  of  two  marksmen  A  and  B  fires  10  shots  at  a  horizontal  line  XY, 
Their  records  are  indicated  by  the  following  diagrams.  Basing  your  con- 
clusion upon  the  mean  deviation,  can  you  determine  who  made  the  better 
record? 
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X 


2 


2 


A's  Record 


S 


Y 


'     1^  IS 

B'8  Record 


34.   THE  STANDARD  DEVIATION 

Unquestionably  the  most  universally  used  measure  of  dispersion 
is  the  standard  deviation.  It  is  usually  denoted  by  <tx  (sigma),^  and 
is  defined  as  the  square  root  of  the  mean  of  the  squares  of  all  the  in- 
dividual deviations  measured  from  the  arithmetic  mean.  Expressed 
as  a  formula,  this  definition  becomes: 


V  iV      V  N 


(5) 


If  the  original  measures  are  grouped  in  a  frequency  distribution, 
the  definition  becomes: 


N 


-si 


N 


(6) 


It  will  be  noted  that  the  squaring  of  the  deviations  removes  the 
objectionable  feature  of  signs  noted  in  the  preceding  section  when 
discussing  the  mean  deviation.  Further,  the  squaring  gives  added 
weight  to  the  extreme  measures,  a  desirable  feature  for  some  types  of 
data.  It  should  also  be  noted  that  taking  the  square  root  of  the  mean 
of  the  squared  deviations  leaves  a  expressed  in  the  original  unit  of 
measure. 

Formulas  (5)  and  (6)  should  be  learned  in  several  forms,  thus: 


r2  =  Na'  =  Sx2/(x),  etc. 


1  We  shall  generally  omit  the  subscripts,  employing  them  only  when  neces- 
sary, as  in  theoretical  developments  and  for  purposes  of  identification.  [See 
p.  61J 
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Unless  otherwise  stated,  the  standard  deviation  is  always  computed 
with  the  deviations  measured  from  the  arithmetic  mean.  This  is 
due  to  the  theorem  that  the  sum  of  the  squares  of  the  deviations 
about  M  is  less  than  if  taken  at  any  other  point.  We  shall  soon 
prove  this  theorem. 

The  quantity 

N 


is  usually  spoken  of  as  the  second  moment  —  since  each  deviation  is 
squared  —  of  the  distribution  about  the  mean  expressed  in  (original 
units) ^  and  is  designated  by  V2  (read:  nu  two).  Hence: 


-2  — 


P2 


The  quantity,  o-^,  is  also  known  as  the  variance  of  the  distribution. 

The  computation  of  a  from  the  definition,  or  formula  (5),  is  a 
decidedly  simple  though  sometimes  tedious  matter.  Let  us  consider 
the  familiar  distribution  of  college  algebra  marks  as  previously  con- 
sidered in  Tables  8,  15,  and  17.  The  arithmetic  mean  has  been 
found  to  be  74.48.  The  following  table  shows  the  steps  involved. 


Table  23.  Computing  <r  for  the  Distribution  of  Grades  in 
College  Algebra  by  the  Definition   M  =  74.48 


X 

fix) 

a;  =  Z  -  ilf 

x'f(x) 

95 

4 

20.52 

421.0704 

1,684.2816 

90 

6 

15.52 

240.8704 

1,445.2224 

85 

12 

10.52 

110.6704 

1,328.0448 

80 

19 

5.52 

30.4704 

578.9376 

75 

37 

0.52 

0.2704 

10.0048 

70 

24 

~  4.48 

20.0704 

481.6896 

65 

11 

-  9.48 

89.8704 

988.5744 

60 

6 

~  14.48 

209.6704 

1,258.0224 

55 

4 

-  19.48 

379.4704 

1,517.8816 

60 

2 

-  24.48 

599.2704 

1,198.5408 

Total 

125 

10,491.2000 

<T^  =  V2  = 


10491.2 
125 


=  83.9296  (c.u.)" 


<r  =  Vvl  =  9.16  =  9.2  c.u.  (approximately) 
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It  may 
numerous 


frequently  happen  that  the  measuresS  are  not  sufficiently 
to  warrant  their  arrangement  in  a  frequency  distribution. 

  Thus,  consider  the  10  scores  in 

centigrade  units  that  were  made 
on  a  certain  test  by  10  students  of 
algebra.  The  scores  are  given  in 
column  one  of  the  table.  To  apply 
formula  (5)  to  these  values  we 
proceed,  as  the  table  shows,  to  find 
M  and  then  x  corresponding  to 
each  score.  We  have 

=  10        =  810  M  =  81  c.u. 
2x2  =  1374 


1374 
10 


=  11.7  c.u. 


EXERCISES 

1.  Statistical  data  of  the  United  States  Department  of  Agriculture  show 
the  following  average  yields  in  bushels  per  acre  for  the  three  specified 
crops.   Compute  cr  for  each  grain. 


Year 

Wheat 

Rye 

Oats 

Year 

Wheat 

Rye 

Oats 

1923 

13.3 

11.3 

30.5 

1928 

15.4 

11.7 

32.9 

1924 

16.0 

15.0 

34.0 

1929 

13.0 

11.4 

29.3 

1925 

12.8 

11.3 

31.9 

1930 

14.2 

12.8 

32.2 

1926 

14.7 

10.3 

26.6 

1931 

16.3 

10.4 

28.1 

1927 

14.7 

15.1 

27.1 

1932 

13.0 

12.2 

30.1 

How  many  of  the  given  10  values  are  included  in  the  interval  M  db  <T? 
Test  for  each  grain. 

2.  a.  Prove:  Mx+a  =  Mx  +  A 
State  this  theorem  in  words. 

b.  Prove:  Mx-a  =        —  A 
State  this  theorem  in  words. 

c.  Prove:  (Tx+a 

d.  Prove:  <Tx-a 

e.  Prove:  X^X  -  M]«  -  XX^  -  ATM*  =  ZX^  ~  YtLI^XJ 
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3.  Compute  a  for  the  given  distribution. 


Class 

X 

Kx) 

32.5-37.5 

35 

2 

Q 
o 

22.5-27.5 

25 

12 

17.5-22.5 

20 

26 

12.5-17.5 

15 

16 

7.5-12.5 

10 

6 

2.5-  7.5 

5 

2 

Total 

72 

Owing  to  the  fact  that  the  value  M  usually  comes  out  decimally, 
computing  a  by  formula  (6)  is  usually  laborious,  even  tedious,  hence 
we  are  driven  to  seek  other  methods.  We  shall  develop  two  other  im- 
portant methods  for  computing  cr.  The  first  method  will  express  a*  in 
terms  of  the  original  variates,  X,,  and  the  second  will  express  cr  in 
terms  of  a;'*,  deviations  in  class  units  of  X,  from  the  arbitrary  origin. 

Referring  to  Figure  1  (p.  73),  we  note  that: 

x  =  X  -  M 


Hence : 


But: 


_  2(X2  -  2MX  +  M^)f{x) 

N 

_  SX2/(a:)      2Mi:Xfix)  ,  M'i:f(x) 

I 


N 


N 


=  M   and   S/(x)  =  N 

N 


Therefore: 


N 


N 


from  which  we  obtain: 
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This  formula  gives  a  straightforward  method  for  computing  o* 
when  the  original  values  of  X  are  not  too  large  or  when  a  table  of 
squares  is  accessible.  We  shall  illustrate  the  use  of  the  formula  for 
the  distribution  of  college  algebra  grades.  As  a  matter  of  fact  this 
table  is  a  continuation  of  Table  15  (p.  62). 


Table  24.    Computing  <r  of  the  Grades  in  College  Algebra  by  (7) 


X 

Xf{x) 

95 

4 

380 

36,100 

90 

6 

540 

48,600 

85 

12 

1,020 

86,700 

80 

19 

1,520 

121,600 

75 

37 

2,775 

208,125 

70 

24 

1,680 

117,600 

65 

11 

715 

46,475 

60 

6 

360 

21,600 

55 

4 

220 

12,100 

50 

2 

100 

5,000 

Total 

125 

9,310 

703,900 

M  =  ^  =  74.48  AP  =  5547.2704 

125 

SXV(x)  _  703900  _  „ 


(T  =  V'563L2  -  5547.2704  =  \/83.9296  =  9.16  c.u. 

A  third,  and  still  more  useful,  method  for  computing  a  will  now 
be  established.  The  method  is  analogous  to  that  used  in  deriving 
formula  (3)  of  Section  24  (p.  71).   From  Figure  1  (p.  73)  we  have 

X  +  whx  =  wx'    or   X  =  w{x'  —  bz) 
where  w,  x',  and  bx  are  defined  as  in  Section  24: 

„      Sxy(x)     Ilwjx'  -  bz)Jfix) 

c   =P,  =  — ^  =  

^2     ,„2  r^x^'/Cx)     2bzEx'f(x)  ,  bVZf{x)-\ 
Recalling  that  — =  bx  and  2/(x)  ==  N,  we  have: 
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(8) 


(9) 


Computing  (t  by  (9),  which  is  based  upon  the  class  interval  as  a  unit 
of  measure,  we  shall  call  the  short  method  for  computing  the  standard 

Table  25.   Computing  a  for  the  Grades  in  College  Algebra  by  (9) 


X 

fix) 

,     X  -75 
^  ^  6 

x'f(x) 

x'^fix) 

95 

4 

4 

16 

64 

90 

6 

3 

18 

54 

85 

12 

2 

24 

48 

80 

19 

1 

19 

19 

75 

37 

0 

0 

0 

70 

24 

-  1 

-  24 

24 

65 

11 

-  2 

-  22 

44 

60 

6 

-  3 

-  18 

54 

55 

4 

-  4 

-  16 

64 

50 

2 

-  5 

-  10 

50 

Total 

125 

-  13 

421 

deviation.  We  shall  illustrate  its  use  by  computing  cr  for  the  dis- 
tribution of  the  grades  in  college  algebra.  It  will  be  noted  that 
Table  25  is  a  mere  continuation  of  Table  17  (p.  73). 

/i  =  75,      «)  =  5,      iV  •=  125 

-  13 


125 


=  -  0.104 


61  =  0.010816 


M  =  75  +  5(-  0.104)  =  74.48  c.u. 

^xjjz)  _  421  _ 
~N~  ~  125  ~  ^-^^^ 

(T  =  5V3.368  -  0.010816  =  5\/3.367184  =  5(1.832)  =  9.16  c.u. 

The  observant  student  will  note  that  in  computing  <t  we  have  the 
quantities  needed  to  compute  M. 

Xx'Hix) 

The  quantity  — — -  is  usually  denoted  in  statistics  by  p't  (read: 
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nu  two  prime),  and  is  called  the  second  moment  about  the  arbitrary 
origin  expressed  in  {class  unitsY.  Hence: 

or2  =  ^2  ==  w\v'2  ~  v[^) 
If  we  write  formula  (8)  in  the  form 

or 

a  careful  interpretation  leads  to  an  important  theorem  to  which 
attention  has  previously  been  called.  For  Na'^  =  llx^f{x)  is  the  sum 
of  the  squares  of  the  deviations  of  the  variates  about  the  mean; 
^{wx'yf{x)  is  the  sum  of  the  squares  of  the  deviations  of  the  variates 
about  any  point,  and  N{wbxy  is  a  positive  quantity.  Hence  the 
theorem:  the  sum  of  the  squares  of  the  deviations  of  the  variates  about 
the  mean  is  less  than  the  sum  of  the  squares  of  the  deviations  about 
any  other  point.   [See  Exorcise  28  at  end  of  chapter.] 

If  dispersion  is  to  be  measured  by  the  root-mean-square  deviation 
about  some  point,  the  above  theorem  recommends  our  taking  M 
for  that  point,  for  it  is  about  M  that  the  root-mean-square  deviation 
has  a  unique  value. 

The  coefficient  of  relative  dispersion  based  upon  the  standard 
deviation  is  known  as  the  coefficient  of  variation,  and  is  defined  by  the 
formula: 

y'r  =  ^  (10) 

and  is  usually  expressed  as  a  percentage.  That  is,  the  variabiUty 
is  expressed  as  a  certain  per  cent  of  the  mean. 

A  word  of  comment  at  this  point  with  regard  to  formula  (10)  in 
particular  and  to  relative  variation  in  general  may  be  desirable.  The 
arbitrary  ratio  of  the  standard  deviation  to  the  arithmetic  mean  as  a 
measure  of  relative  variation  as  well  as  the  other  ratios  that  we  have 
used,  e.g.  formula  (4),  seems  to  be  based  more  on  psychological  than 
on  logical  grounds. 

Despite  individual  variation  that  we  have  noted  among  statistical 
phenomena,  we  have  learned  from  experience  to  formulate  judgments 
of  the  individual  of  normal  size.  That  is,  the  establishment  of  norms 
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seems  to  be  a  natural  process.  We  hear  the  expressions:  ''What  a 
large  apple!''  ''My,  isn't  she  tiny?"  ''How  emaciated  he  is!" 
"What  a  tremendous  ear  of  corn!"  "Wasn't  that  a  hard  rain?" 
"What  a  hot  day  in  May!".  All  these  expressions  imply  the  notion 
of  a  norm  as  well  as  variation  from  a  norm. 

We  have  also  formed  judgments,  which  at  this  time  may  be  crude 
and  inadequate,  of  relative  variation  with  respect  to  the  norm.  Any 
student,  without  using  his  statistical  analysis,  knows  that  a  nose  one 
inch  longer  than  the  average  length  of  noses  is  more  monstrous  than 
a  height  that  is  one  inch  longer  than  the  average  of  the  heights.  In 
other  words,  a  variation  is  large  or  small  depending  upon  the  norm 
with  which  it  is  associated. 

Doubtless  such  considerations  as  the  above  led  Professor  Karl 
Pearson  to  define  the  coefficient  of  variation  as  the  ratio  of  the 
standard  deviation  to  the  arithmetic  mean.  The  arithmetic  mean 
is  taken  to  be  the  norm,  and  the  standard  deviation  measures  the 
variation  from  the  norm. 

We  should  develop  a  statistical  alertness  to  relative  variation  in 
characters  that  are  less  familiar.  Thus,  we  have  found  for  a  distri- 
bution of  weights  of  college  men  that  M  =  138.9  lbs.  We  shall 
find  that  a  =  17.2  lbs.,  and  hence  =  17.2/138.9  =  0.124  =  12.4%. 
That  is,  for  a  group  of  weights  of  young  men,  the  standard  deviation 
is  about  12.5  per  cent,  or  one-eighth,  of  the  mean.  The  heights  of 
these  same  men  will  give  M  =  67.9  in.  and  a  =  2.4  in.,  and  hence 
Va  =  2.4/67.9  =  0.035  =  3.5%.  That  is,  for  a  group  of  heights 
of  young  men  the  standard  deviation  is  about  3.5  per  cent  of  the 
mean.  A  distribution  of  weights,  then,  shows  much  more  variation 
than  a  distribution  of  heights. 

The  general  literature  of  biometry  records  coefficients  of  variation 
for  many  characters.   We  present  herewith  a  few  of  them. 


Character 

Character 

Visual  acuity 
Wt.  of  heart 

(unhealthy) 
Grip,  right  hand 
Wt.  of  heart 

(healthy) 

39.12 
32.39 

25.93 
17.71 

Pulse  rate  per  min. 
Chest  circumference 
Length  of  forearm 
Length  of  foot 
Stature  (English) 

14.89 
8.45 
5.24 
4.59 
3.99 
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Economic  data  generally  show  a  much  larger  variation  than  do 
biometric  data.  (Of  course  the  coefficients  of  variation  for  much 
economic  data  will  not  remain  constant  but  will  vary  from  time  to 
time.)  The  weekly  earnings  of  72,000  Illinois  coal  miners  were 
analyzed.  The  analysis  gave  M  =  $8.37,  <7  =  $2.49,  and  =  29.7. 
An  analysis  of  the  price  of  potatoes  gave  M  =  54.4  cents,  c  =  11.11 
cents,  and  =  19.  The  variation  in  economic  phenomena  will  be 
especially  considered  in  Chapter  6  on  Index  Numbers. 
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1.  What  norm  was  used  in  the  development  of  the  formula  for  F,? 

2.  Compute  a  for  the  earnings  of  each  group  of  Exercise  12,  p.  74. 
The  earnings  of  which  group  show  the  greater  dispersion? 

3.  Compute  (x  for  the  distribution  of  the  sahiries  of  federal  employees 
that  is  given  in  Exercise  36,  p.  109.  Is  it  possible  to  apply  formula  (9)  to 
this  distribution? 

4.  Compute  cr  for  the  distribution  of  the  annual  wages  of  chief  w^age 
earners  that  is  given  in  Exercise  31,  p.  108.  What  is  the  coefficient  of  varia- 
tion for  this  distribution? 

5.  Table  A  gives  the  I.Q.^s  of  905  school  children.  Table  B  gives  the 
weights  of  1,000  school  children.  For  each  distribution  find:  ilf,  Mdj 
Moi  (T,  Does  the  interval  M  ±  3(7  include  all  the  variates  of  each  dis- 
tribution? 


Table  A 


Table  B 


X 

fix) 

X 

fix) 

60.5 

3 

29.5 

1 

70.5 

21 

33.5 

14 

80.5 

78 

37.5 

56 

90.5 

182 

41.5 

172 

100.5 

305 

45.5 

245 

110.5 

209 

49.5 

263 

120.5 

81 

53.5 

156 

130.5 

21 

57.5 

67 

140.5 

5 

61.5 

23 

65.5 

3 

Total 

905 

Total 

1000 

6.  Derive  formula  (9)  from  formula  (7). 
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35.  THE  NORMAL  CURVE  ^ 

In  Section  19  (p.  51)  reference  was  made  to  the  normal  distribution 
and  to  the  general  form  of  the  equation  that  represents  it.  This 
curve  is  so  important  in  statistical  work,  both  theoretical  and  applied, 
that,  although  we  discuss  it  rather  fully  in  Chapter  12,  we  desire  at 
this  point  to  call  attention  to  some  of  its  properties.  The  general  form 
of  the  curve  is  sho^vn  by  curve  (b)  of  Chart  8  and  by  the  curves  (a), 
(b),  and  (c)  of  Section  30  (p.  111).  The  normal  curve  is  characterized 
by  the  synmietrical  arrangement  of  all  the  variates  with  respect  to  a 
line  through  the  central  value,  most  of  the  observations  lying  close 
to  the  mean  and  very  few  differing  from  it  considerably. 

The  normal  curve  is  of  importance  to  us  just  now  in  that  its  proper- 
ties will  assist  us  in  making  certain  generalizations  about  distributions 
that  do  not  differ  too  markedly  from  normality.  And  such  distribu- 
tions are  not  at  all  rare.  Measurements  of  natural  objects  —  such 
as  the  lengths  of  the  leaves  on  a  tree,  the  heights  of  men,  the  lengths 
of  bean  pods,  the  breadths  of  the  heads  of  men,  the  lengths  and 
breadths  of  nuts  —  distribute  themselves  with  a  surprising  closeness  to 
normality  if  large  samples  are  taken.  In  an  approximately  normal 
distribution  of  a  thousand  observations  we  can  estimate  with  sur- 
prising accuracy  the  number  that  differ  from  the  mean  by  definite 
amounts,  say  cr,  2a j  Scr,  etc.  In  fact  these  relations  are  so  regular 
with  the  measurements  of  natural  objects  that  those  which  are  so 
distributed  are  said  to  be  normal.  As  has  been  previously  noted, 
many  data  collected  from  the  fields  of  psychology  and  education  are 
also  of  this  type. 

Chart  9  shows  a  normal  curve.  The  mean,  median,  and  mode 
coincide  at  0.  It  has  a  maximum  at  the  center  and  is  symmetrical 
with  respect  to  the  vertical  line  through  0.  The  curve  crosses  its 
tangent,  that  is,  the  curve  changes  from  concave  to  convex,  at 
points  Ii  and  h  which  are  at  a  distance  a  from  the  vertical  through 
0.  The  curve  approaches  the  X-axis  as  x  gets  large,  though  we  sel- 
dom extend  it  beyond  So*  in  either  direction  from  0  because  the 
number  of  such  deviations  outside  M  =b  3or  is  relatively  insignificant. 

We  have  laid  off  certain  multiples  of  a  on  either  side  of  the  mean. 
It  will  be  proved  in  Chapter  12  that: 

1  If  the  reader  desires  to  know  more  about  the  normal  curve,  its  history  and 
importance,  he  should  read  Section  101. 
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A  N<yrmal  Curve 


"20  -d  O  (T  2(T  X 

M-20-         M-(T  M  M+cr         M-\-2(t  X 


The  interval  from  M  —  cr  to  M  +  a 

includes  approximately  %N. 

The  interval  from  M  —  2a  to  M  +  2(r 

includes  approximately  95  per  cent  of  N. 

The  interval  from  M  -  Sa  to  M  +  3<t 

includes  approximately  99  per  cent  of  A^. 

Further,  it  will  be  shown  that: 

The  range  equals  6<t  approximately. 

Q  equals  fo*  approximately. 

M.D.  from  M  equals  |(t  approximately. 

Of  course  as  an  observed  distribution  departs  from  normality,  the 
approximations  are  less  close. 

The  number  of  units  of  a  that  must  be  laid  off  on  either  side 
of  Af  of  a  normal  distribution  to  include  the  total  frequency,  N, 
varies  with  N,  If  N  is  very  large,  more  than  =h  3(r  is  necessary 
whereas  if  A''  is  small  less  than  db  3(r  is  needed.  The  following  table 
gives  the  interval  that  includes  N  for  a  normal  distribution. 
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N 



Interval 

AT 

Interval 

10 

M  ±  1.65(7 

200 

M  dr  2.81(7 

20 

M  dr  1.96<r 

500 

M  d=  3.0(7 

30 

M  ±  2.13<r 

1,000 

M  ±  3.3(7 

50 

M  ±  2.33(7 

10,000 

M  ±  3.9(7 

100 

M  i  2.58(7 

100,000 

M  d=  4.4(7 

For  the  distribution  of  college  algebra  grades  we  have  found: 

M  =  74.48  c.u.  M  -  or  =  65.32  c.u. 

(T  =   9.16  c.u.  M  +  0-  =  83.64  c.u. 

How  many  of  the  125  grades  of  the  sample  lie  in  this  interval  from 
65.32  to  83.64? 

To  assist  us  in  answering  the  question  let  us  construct  the  histogram 
for  the  central  portion  of  Table  8  (p.  26). 


Figure  13 
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I 
I 
I 

19 
I 

I 
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I 

12 
\ 
I 


65\  67,6 
65.32 


70 


75 


80  82.5 


85 


83.64 


The  interval  evidently  includes  the  total  frequencies  of  the  three 
central  groups  (24  +  37  +  19  =  80),  and  an  undetermined  part  of 
the  classes  designated  by  the  class  marks  65  and  85.  From  65.32 
to  67.5  is  2.18,  and  since  the  variates  are  uniformly  distributed  over 
the  interval  we  must  include  (2.18/5)11  =  4.79.  Similarly,  from  82.5 
to  83.64  is  1.14,  and  hence  we  must  include  (1.14/6)12  =  2.73. 
Hence  the  interval  from  65.32  to  83.64  includes  80  +  4.79  +  2.73 
=  87.52,  or  about  70  per  cent  of  the  125  variates.  The  result  here  is 
more  than  fAT  for  the  reason  that  our  distribution  is  loaded  at  75. 
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When  dealing  with  a  distribution  of  discrete  variates,  interpola- 
tion is  usually  not  necessary.  For  example  consider  the  distribution 
given  in  Exercise  3  (p.  42).  We  have  previously  computed  for  this 
distribution : 

M  =  53.67  c.u. 
or  =  2.16 
M  -  a  =  51.51 
M  +  a  =  55.83 

The  interval  from  51.51  to  55.83  includes  the  frequencies  with  class 
marks  at  52,  53,  54,  and  55,  that  is,  a  total  of  468  ( =  96  +  134  +  127 
+  111)  or  66.57  per  cent  of  N. 

36.   THE  PROBABLE  ERROR 

A  measure  of  dispersion  that  particularly  relates  to  the  normal 
curve  is  the  probable  error,^  Ex-  It  is  a  distance  which,  when  laid 
off  on  either  side  of  the  mean  of  a  normal  curve,  defines  an  interval 
that  includes  one  half  the  total  area  under  the  curve.  Stated  some- 
what diiTerentl}',  the  probable  error  of  a  distribution  of  variates 
normally  distributed  is  that  deviation  on  either  side  of  the  mean 
within  which  half  the  variates  lie.  Then,  since  half  the  total  fre- 
quency lies  within  the  interval  Mx  ~  Ex  to  Mx  +  Ex,  it  is  an  even 
chance  that  a  variate  selected  at  random  falls  within  this  interval. 

The  following  figure  may  assist  in  clarifying  the  probable  error 
concept.  This  figure  shows  the  per  cent  of  the  total  frequency  that 
is  included  by  the  indicated  probable  error  units. 

The  probable  error  is  closely  related  to  the  standard  deviation. 
The  relationship  is  indicated  by  the  equation 

=  0.6745crA'  (11) 

Approximately  then.  Ex  is  about  ^ax  and  ax  is  about  |£x.  //  a 
distribution  is  not  normal ,  it  is  customary  to  define  the  probable  error 
by  (11). 

Since  any  multiple  of  a  can  be  expressed  in  terms  of  and  vice 
versa,  it  is  natural  to  inquire  why  we  have  both  and  what  are  the 

^  Generally  we  shall  omit  the  subscript,  employing  it  only  for  identifications. 
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advantages  of  E,  The  standard  deviation  cr,  or  a  synonym  for  it, 
is  the  older  measure.  The  early  nineteenth-century  astronomers, 
particularly  Bessel  in  1815  and  Gauss  in  1816  —  who  were  among 
the  first  men  to  work  with  statistical  analysis  —  desired  an  interval 
within  and  outside  of  which  it  is  equally  probable  that  a  random 
measurement  of  a  normal  distribution  will  occur.  This  interval  on 
the  X-scale  is  from  M  -  E  to  M  E,  or  from  M  -  0.6745(7  to 
M  +  0.6745(7.  Bessel  first  used  the  term  probable  error,  Gauss 
and  the  contemporary  writers  liked  the  term,  and  so  tradition  has 
kept  it  in  use  to  this  day. 

Of  course  there  is  a  facility  of  language  when  using  the  probable 
error  that  may  account  for  its  popularity.  For  example,  it  is  an 
''even  chance,''  a  "fifty-fifty  chance,''  or  a  ''one-to-one  shot"  that 
a  measure  selected  at  random  from  a  normal  distribution  falls  within 
or  without  M  dh  E.  In  other  words  it  is  as  "likely  as  not"  that  a 
measure  selected  at  random  from  a  group  of  normally  distributed 
variates  will  fall  within  the  interval  M  zt  E.  Equally  simple  lan- 
guage does  not  obtain  when  using  the  standard  deviation.  Thus, 
assume  a  distribution  of  the  heights  of  1,500  men  with  M  =  67.5  in., 
{7  =  2.5  in.,  and  E  =  0.6745  (2.5)  =  1.7  in.  Then  if  a  measure  is 
selected  at  random  from  this  group  it  is  as  likely  as  not  to  fall  within 
the  interval  67.5  db  1.7  inches. 
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EXERCISES 

1.  The  data  of  the  following  tables  are  taken  from  Bulletin  No.  620 
of  the  U.S.  Department  of  Labor,  ''Wages,  Hours,  and  Working  Condi- 
tions in  the  Folding-Paper-Box  Industry,  1933,  1934,  1935/'  They  present 
the  hourly  earnings  of  employees  in  the  U.S.  in  the  paper-box  industry. 

Compute  M,     and  E  for  each  table. 


X 

May,  1983 

August,  1934 

August  J  1935 

JW 

///».^ 

10  a.u.  15 

57 

1 

0 

15  a.u.  20 

168 

2 

8 

20  a  u.  25 

538 

9 

19 

25  a.u.  30 

507 

20 

96 

30  a.u.  35 

622 

231 

286 

35  a.u.  40 

541 

1371 

1332 

40  a.u.  45 

485 

1834 

1670 

45  a.u.  50 

357 

919 

969 

50  a.u.  55 

328 

729 

806 

55  a.u.  60 

172 

484 

563 

60  a.u.  70 

327 

739 

758 

70  a.u.  80 

211 

420 

457 

80  a.u.  100 

174 

533 

555 

100  a.u.  120 

42 

224 

264 

120  a.u.  150 

17 

85 

82 

Total 

4546 

7601 

7865 

2.  Let  the  five  numbers  3,  4,  5,  6,  7  be  a  universe.  Select  different 
samples  of  three  from  these  five  numbers,  10  samples  in  all,  and  compute 
their  means.  Thus 

,^  3+4+5  3+4+6  3+4+7  5+6+7 
Ml  =   ^  7     M2  =   ^  y     Mz  ==  y  '  '  '7     Mio  =  ~  

a.  Find  the  mean  of  the  10  sample  means.  How  does  it  compare  with 
the  mean  of  the  universe? 

b.  Find  the  standard  deviation  of  the  10  sample  means.  How  does  it 
compare  with  the  standard  deviation  of  the  universe? 

3.  Consider  the  universe  of  numbers  5,  10,  15,  20,  25.  Treat  these 
numbers  as  you  did  those  of  Exercise  2. 

4.  The  problem  of  sampling  has  been  called  by  Karl  Pearson  the  funda- 
mental problem  in  statistics.  Often  our  only  statistical  knowledge  of  the 
parent  population  is  obtained  from  a  study  of  samples  drawn  from  it. 


140 


MEASUREMENT  OF  DISPERSION 


Stated  in  rather  general  terms,  the  question  is:  how  well  does  the  sample 
describe  the  parent  population?  More  precisely,  the  problem  is:  how  do 
M  and  <t  computed  from  the  sample  compare  with  M  and  c  computed 
from  the  parent  population? 

Table  26(a)  presents  10  sample  distributions  of  100  each.  The 
parent  population  consists  of  1,000  individuals.  (By  assigning  the 
sample  distributions  to  various  members  of  the  class  the  computa- 
tional labor  will  be  greatly  lightened.) 

(a)  Find  Af  and  cr  for  each  sample  and  for  the  total. 

(b)  Find  the  standard  deviation  of  the  10  means  of  the  samples. 

(c)  How  many  of  the  means  are  included  in  the  interval  (Mean  of  means) 
db  (standard  deviation  of  means)? 

(d)  Compare  the  mean  of  the  10  means  with  the  mean  of  the  total. 


Table  26(a).    Distribution  of  the  Weights  of  1,000  Male  Students 

(Measurements  to  nearest  pound) 


Class  Mark 


Frequencies 


(Pounds) 

1st 

2nd 

3rd 

4th 

6th 

6th 

7th 

8th 

9th 

10th 

Total 

100 

100 

100 

100 

100 

100 

100 

100 

100 

100 

95.25 

1 

1 

2 

105.25 

2 

5 

2 

4 

1 

3 

3 

1 

21 

115.25 

11 

11 

13 

9 

10 

9 

9 

14 

9 

9 

104 

125.25 

14 

24 

15 

21 

19 

25 

20 

14 

21 

23 

196 

135.25 

26 

16 

31 

19 

31 

21 

30 

22 

27 

25 

248 

145.25 

23 

17 

16 

26 

21 

19 

17 

21 

18 

19 

197 

155.25 

11 

16 

11 

11 

13 

16 

12 

17 

11 

15 

133 

165.25 

2 

5 

3 

5 

4 

5 

5 

5 

9 

4 

47 

175.25 

5 

1 

5 

0 

1 

1 

2 

4 

2 

4 

25 

185.25 

3 

3 

1 

4 

0 

1 

1 

1 

14 

195.25 

1 

1 

1 

1 

0 

2 

1 

7 

205.25 

0 

1 

1 

1 

1 

4 

215.25 

0 

0 

225.25 

0 

0 

235.25 

1 

1 

245.25 

1 

1 

Total 

100 

100 

100 

100 

100 

100 

100 

100 

100 

100 

1000 

M 

a 
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6,  Treat  the  data  in  Table  26(b)  as  you  did  those  of  Table  26(a). 


Table  26(b).    Distribution  op  the  Heights  of  1000  Male  Students 

(Measurements  to  nearest  ^  inch) 


Class  Mark 


Frequencies 


(Inches) 

1st 

end 

Srd 

4th 

6th 

7th 

8th 

9th 

10th 

Total 

100 

100 

100 

100 

100 

100 

100 

100 

100 

59.45 

1 

1 

60.45 

1 

0 

1 

61.45 

1 

0 

2 

2 

2 

7 

62.45 

3 

3 

1 

2 

1 

5 

2 

1 

18 

63  45 

6 

6 

4 

5 

3 

3 

0 

2 

mm 

3 

1 

33 

64.45 

6 

4 

9 

3 

6 

11 

3 

6 

7 

8 

63 

65.45 

11 

7 

13 

14 

10 

9 

10 

6 

9 

8 

97 

66.45 

12 

12 

11 

13 

17 

11 

19 

12 

18 

12 

137 

67.45 

13 

23 

17 

22 

13 

10 

14 

14 

13 

16 

155 

68.45 

12 

15 

20 

15 

20 

16 

22 

26 

17 

17 

180 

69.45 

14 

8 

4 

9 

11 

13 

12 

15 

14 

22 

122 

70.45 

9 

8 

5 

4 

11 

5 

4 

6 

10 

6 

68 

71.45 

7 

8 

8 

6 

4 

7 

9 

7 

3 

6 

65 

72.45 

2 

2 

3 

4 

2 

5 

1 

2 

4 

3 

28 

73.45 

1 

1 

2 

0 

2 

3 

1 

1 

2 

1 

14 

74.45 

1 

1 

3 

0 

1 

0 

6 

75.45 

0 

0 

0 

1 

1 

2 

76.45 

1 

1 

1 

3 

Total 

100 

100 

100 

100 

100 

100 

100 

100 

100 

100 

1000 

M 

a 

6.  Find  the  standard  deviation  of  the  ten  standard  deviations  of 
Table  26(a),  Exercise  4.  Which  shows  the  greater  dispersion,  the  sample 
means  or  the  sample  standard  deviations? 

7.  Find  the  standard  deviation  of  the  ten  standard  deviations  of 
Table  26(b),  Exercise  5.  Which  shows  the  greater  dispersion,  the  sample 
means  or  the  sample  standard  deviations? 

37.   THE  SIGNIFICANCE  OF  THE  MEAN  AND  THE 

STANDARD  DEVIATION 

Thus  far  our  statistical  analysis  of  a  given  group  has  enabled  us  to 
abstract  certain  qualities  of  the  group.  The  most  important  of  these 
qualities  are:  (1)  the  central  or  typical  condition  of  the  group,  and 
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(2)  the  degree  of  variability  of  the  members  of  the  group.  The 
central  condition  can  be  obtained  from  the  appropriate  measure  of 
central  tendency,  and  the  degree  of  variability  from  the  appropriate 
measure  of  dispersion,  preferably  the  standard  deviation  and  the 
coefficient  of  variation.  For  example,  important  facts  of  the  dis- 
tribution of  college  algebra  marks  are: 

M  =  74.48  c.u.  or  =  9.16  c.u. 

These  summarizing  constants  contain  the  kernel  of  the  distribution. 
They  give  a  fairly  complete  numerical  description  of  the  sample. 

Now  what  statistical  judgments  concerning  the  parent  population 
—  which  consists  of  all  the  grades  in  college  algebra  that  are  recorded 
at  Bucknell  University  —  can  one  form  from  the  examination  of  the 
sample?  ^  We  do  not  desire  to  answer  this  question  completely  at 
this  point,  as  Chapter  13  is  devoted  entirely  to  the  problem  that  we 
raise  here,  but  we  may  appropriately  state  certain  facts,  which  must 
at  this  time  be  accepted  without  proof. 

If  the  sample  that  we  have  been  considering  was  a  random  one  — 
that  is,  if  any  mark  in  college  algebra  had  the  same  chance  of  being 
selected  a  member  of  our  sample  as  any  other  mark  —  we  may 
expect  that  if  another  sample  were  selected  its  mean  and  its  standard 
deviation  would  differ  but  slightly  from  those  we  have  computed. 
Furthermore,  we  may  expect  the  true  mean  and  the  true  standard 
deviation  of  the  parent  population  to  differ  but  little  from  those  of 
the  sample. 

It  has  become  customary,  therefore,  for  statisticians  who  desire  to 
make  statistical  estimates  of  the  parent  population  from  an  analysis  of  a 
sample  to  record  the  results  in  such  a  manner  that  a  definite  range  of 
variation  about  the  estimated  measure  is  determined.  The  limits 
of  the  definite  range  of  variation  about  an  estimated  value  are 
established  in  such  a  way  that  we  can  state  the  probability  that  the 
known  value  (mean,  standard  deviation,  etc.)  found  from  the  sample 
does  not  differ  more  than  a  determinate  amount  from  the  unknown 
and  generally  unknowable  true  values  of  similar  constants  of  the 
parent  population  or  universe.  This  determinate  amount  is  generally 
a  standard  deviation  or  a  probable  error.   The  computed  value  de- 

*  As  a  matter  of  fact  125  measurements  constitute  far  too  small  a  sample  for 
purposes  of  generalization.  We  use  it  here  merely  for  illustrative  purposes. 
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rived  from  the  sample  is  known,  whereas  the  estimated  value  be- 
longing to  the  universe  is  generally  not  known.  The  computed  value 
is  the  basis  of  our  estimate  and  our  task  is  to  measure  the  reliability 
of  the  estimate.  This  measure  of  reliability  is  expressed  in  terms 
of  chance  or  probability.  Our  method  is  to  find  the  determinate 
amount  and  state  the  probability  that  the  known  value  diverges 
this  amount  from  the  true  value. 

As  an  example,  what  can  we  say  about  the  mean  of  the  universe, 
Mu,  of  college  algebra  marks  from  our  analysis  of  the  sample?  We 
must  first  compute  the  determinate  amount,  the  probable  error  of 
the  mean,  Em^  then  interpret  the  result,  where 

.6745  (Standard  deviation  of  sample) 
vNumber  in  the  sample 

or,  in  brief, 

Em  =  .6745 
For  the  problem  under  discussion 

Em  =  .6745  =  0.55  c.u. 

V125 

and,  as  is  customary,  we  write 

Mu  -  Mean  of  universe  =  74.48  ±  0.55  c.u. 

which,  translated  into  English,  reads  ''74.48  with  a  probable  error  of 
0.55. This  means  tliat  the  chances  are  even  that  the  sample  mean, 
74.48  C.U.,  does  not  differ  more  than  0.55  c.u.  from  the  true  mean, 
Mw  Note  that  we  do  not  say,  ''The  chances  are  even  that  the  true 
mean  Mu  does  not  differ  more  than  0.55  from  the  sample  mean  74.48.^* 
Mu  is  a  fixed  vahie,  it  is  not  a  variable  as  the  quotation  implies. 
The  sample  means,  however,  are  variable.  This  is  an  important  dis- 
tinction. 

Doubtless,  the  two  preceding  paragraphs  look  formidable.  Sup- 
pose we  now  try  to  make  understandable  what  we  have  said.  Our 
first  sample  of  125  scores  chosen  at  random  gave  a  mean  74.48  c.u. 
and  a  standard  deviation  of  9.16  c.u.  Another  sample  of  125  scores 
chosen  in  a  similar  manner  would  probably  yield  slightly  different 
results.  In  other  words,  ilxese  so-called  statistical  constants  show 
variation  as  we  move  from  sample  to  sample. 
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We  continue  this  sampling  process  until  we  have  a  large  number 
of  sample  means,  sample  standard  deviations,  etc.  This  large  num- 
ber of  sample  means  may  be  formed  into  a  distribution  of  means. 
The  distribution  of  means  has  its  mean,  Mm;  its  standard  deviation, 
the  standard  deviation  ^  of  the  means,  (Tm;  and  its  probable  error, 
the  probable  error  of  the  means,  Em- 

This  distribution  of  means  has  some  remarkable  properties: 

1.  It  is  a  normal  distribution. 

2.  Its  mean,  Mmj  is  equal  to  the  mean  of  the  universe,  i.e.  Mm  =  Mu* 

3.  Its  standard  deviation  is  given  by  (Tm  — 

Vat 

4.  Its  probable  error  is  given  by  Em  =  .6745(73/  =  .6745  ^ 


5.  Two  thirds  of  the  sample  means  are  included  in  the  interval  Mm  :^<^m 
or  in  Mu  ±  (tm- 

6.  One  half  of  the  sample  means  are  included  in  the  interval  Mm  ^  Em 
or  in  Mu  ±  Em»  Thus,  the  probable  error  of  the  mean,  Em,  is  a  value 
such  that  the  chances  are  even  that  a  sample  mean  lies  within  the 
interval,  Mu  ±  Em,  or  outside  the  interval. 

7.  Practically  all  the  sample  means  are  included  in  the  interval 
Mm  ±  So-Af  or  Mu  ±  Sotm- 

The  student  will  observe  from  the  third  property  that,  even  for 
a  reasonably  large  sample,  the  distribution  of  means  is  rather  con- 
centrated. Thus  for  iV  =  100,  ctm  =  (r/lO,  and  it  Scm  =  ±  3(7/10, 
which  is  a  relatively  small  range.  So  if  N  is  large,  a  sample  mean  M 
is  an  excellent  estimate  of  Mw  The  little  variation  in  the  distribu- 
tion of  means  shows  that  the  mean  is  a  stable  measure  of  central 
tendency.  Its  stability  is  illustrated  by  the  rather  narrow  normal 
curve  of  Figure  15. 

Similarly,  the  sample  standard  deviations  may  be  formed  into 
a  distribution  of  standard  deviations.  While  this  distribution  is 
not  exactly  normal  for  large  values  of  N,  if  the  samples  are  taken 
from  a  normal  universe  it  does  not  differ  a  great  deal  from  normality. 
It  has  its  mean,  ilf  <r,  its  standard  ^  deviation  or<r,  and  its  probable 
error,  Ea. 

^  The  standard  deviation  of  the  mean  is  frequently  called  the  standard  error 
of  the  mean. 

*  The  standard  deviation  of  <r  is  frequently  called  the  standard  error  of  a. 


SIGNIFICANCE  OF  M  AND  <r 

Figure  15 


145 


(Tu—cr  approximately 


^3f  ^  ^    7F  CLPProximately 

! Curve  of  Means 
of  Samples 


Curve  of  the  Parent 
Distribution  or 
Universe 


The  sample  standard  deviations  are  distributed  almost  sym- 
metrically about  McTj  which  is  approximately  equal  to  the  standard 
deviation  of  the  universe  (Tm,  and  with  a  measurable  variation. 
We  can  measure  the  variation  of  the  sample  cr's  by  or<r  or  by  E^. 
Formulas  for  evaluating  them  are  the  following: 

(T  la 
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=  .6745a-<,  =  .6745-^  =  .707£'m  =  .4769<7Ar 

V2iV 

The  probable  error  of  the  standard  deviation,  Ea,  is  a  value  such 
that  the  chances  are  even  that  a  sample  <7  will  lie  within  the  interval 
(Tu  ±.  Eay  or  outside  the  interval.  As  an  example  illustrating  its  use, 
what  can  we  say  about  the  standard  deviation  of  the  universe,  <7u, 
of  college  algebra  marks  from  our  analysis  of  the  sample?  We  find 

E^  =  .6745  =  .39  c.u. 

V250 

or  from 

Ea  =  .707^;^  =  .707(.55)  =  .39  c.u. 
and,  as  is  customary,  we  write 

(Tu  =  9.16  ±  0.39  c.u. 

and  which  we  read  ''9.16  with  a  probable  error  of  0.39.''  This  means 
that  the  chances  are  even  that  the  sample  a,  9.16  c.u.,  does  not 
dififer  more  than  0.39  c.u.  from  the  true  standard  deviation,  (Tu. 

If  the  student  would  prefer  to  use  a  "two-to-one-chance"  lan^ 
guage,  he  may  do  so  by  using  as  the  measures  of  variation  (Tm  and 
atr.  This  is  quite  a  matter  of  taste  and  about  tastes  we  do  not 
wish  to  argue. 

As  an  illustrative  example,  again  we  consider  the  distribution  of 
college  algebra  marks.  We  have 

^  9,16 
aM  =  — 7=  =  —7=  =  0.82  c.u. 
VN  \/125 

(Ta  =         =  .707  ^  =  (.707) (.82)  =  0.57  c.u. 

Thus,  the  odds  are  two  to  one  that  a  sample  mean  will  not  diffei 
more  than  0.82  c.u.  from  the  mean  of  the  universe,  Mu.  Or,  about 
two  thirds  of  all  sample  means  are  included  in  the  interval  Mu  =b  (Tm. 

Similarly,  the  odds  are  two  to  one  that  a  sample  standard  deviation 
will  not  differ  more  than  0.57  c.u.  from  the  standard  deviation  ol  the 
universe,  (r^.  That  is,  about  two  thirds  of  the  sample  standard 
deviations  are  included  in  the  interval      ±  aa. 
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EXERCISES 

1.  Compute  (T  and  M.D.  from  M  for  the  distribution  of  Exercise  2, 
page  54, 

2.  Compute  <r  for  the  distribution  of  Exercise  1,  page  41.  How  many 
measures  are  included  by  the  interval  M  ±  a?  Does  M  ±  3a  include  the 
entire  group? 

3.  Find  a  for  the  theoretical  distribution  of  Exercise  2,  page  42.  How 
many  measures  are  included  by  the  interval  M  dr  2a? 

4.  Consider  the  distributions  of  heights  and  weights  given  in  Exercise  1, 
page  54.   Which  distribution  has  the  greater  dispersion? 

5.  Compute  a  and  Va  for  the  distributions  (a)  and  (b)  of  scores  in 
English  found  in  Exercise  4,  page  102. 

6.  Find  a  and  Va,  for  the  distributions  of  the  measurements  of  eggs 
found  in  Exercise  15,  page  105.  Compare  these  results  with  those  obtained 
in  Exercise  5. 

7.  a.  Show  that  the  standard  deviation  of  the  first  N  integers  is  given 
by  the  equation: 

b.  Find  a  for  the  first  A''  odd  integers. 

8.  If  Nij  M 1,  and  ai  are  the  frequency,  mean,  and  standard  deviation 
for  one  group  of  measures  and  A'2,  M2,  and  0-2  for  a  second  group,  show^  that 
the  standard  deviation  of  the  group  formed  by  combining  the  two  groups 
is  given  by: 


a 


2  _ 


N,a\  +  N2al  .  N,N2 


+  -^-^  (Ml  -  M,y 


N         '  N 
where 

N  =  Ni  +  N2     the  total  frequency  of  the  combined  groups,  and 
a  —  the  standard  deviation  of  the  combined  groups 


Hint:  See  Exercise  7  on  page  74. 

9.  Apply  the  result  of  the  preceding  exercise  to  find  the  a  for  the  dis- 
tribution given  in  Exercise  8,  page  103. 

10.  At  a  university  1,000  students  were  given  an  objective  test.  The 
distribution  of  marks  was  closely  normal.  The  analysis  gave  M  =?=  72, 
a  ^  S.  What  were  the  approximate  values  of  Q,  Qi,  Qz,  M.D.  from  M, 
Mo?   Find  Ex,  Em,  Ea,  and  interpret  them. 

11.  Compute  Em  and  Ea  for  the  distributions  of  Exercise  1,  page  54. 
Interpret  them. 

12.  Compute  Em  and  E^r  for  the  distribution  of  Exercise  2,  page  54 
Interpret  these  values. 
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13.  In  a  paper,  "Experiment  and  Statistics  in  the  Selection  of  Em- 
ployees," in  the  Journal  of  American  Statistical  Association^  March  1923, 
p.  605,  Mr.  Harry  A.  Wembridge  has  presented  data  that  show  the  points 
scored  on  a  mental  test  by  290  prospective  employees  and  the  per  cent 
of  standard  production  attained  by  these  same  290  persons  after  being 
employed. 

The  results  are: 

Scores  on  Test  Per  cent  Production 

AT  =  290  iV  =  290 

Ml  =  42.33  M,  =  92.02 

0-1  =   9.25  0-2  =  24.47 

Compare  the  variability  in  mental  ability  with  that  of  productive  abiUty. 

14.  Find  Em  for  the  data  in  Number  13  above  and  interpret  the  results. 
16.  The  analysis  of  an  approximately  normal  distribution  of  weekly 

salaries  of  300  men  gave:  M  =  $60.00  and  a  =  $10. 

(1)  About  how  many  received  salaries  between  $50  and  $70? 

(2)  About  how  many  received  salaries  between  $40  and  $80? 

(3)  Approximately,  what  were  the  largest  and  the  smallest  salaries 
received? 

16.  The  analysis  of  two  approximately  normal  distributions  of  the 
weekly  salaries  of  300  men  each  gave: 

1st  distribution  2nd  distribution 

Ml  =  $35.00  M2  =  $60.00 

Md  =  $34.00  Md  =  $58.00 

(Ti  =  $  7.00  (7-2  =  $10.00 

Relatively,  which  distribution  shows  the  greater  dispersion? 

17.  Distributions  of  the  heights  and  weights  of  1,500  college  men  were 
analyzed  with  the  following  results: 

Heights  Weights 

N  =  1500  N  =  1500 

Ml  =  67.5  inches  M2  =  135.4  pounds 

<7i  =   2.5  inches  cr2  =    15.2  pounds 

Which  distribution  shows  the  greater  dispersion? 

18.  From  the  statistical  summaries  given  in  Number  17,  assuming  the 
distributions  were  approximately  normal,  what  are  some  conclusions  that 
may  safely  be  drawn? 

19.  Prove:  Max  =  AMx-  Illustrate. 

20.  Prove:  Max+b  =  AMx  +  B,  Illustrate. 

21.  Prove:  <tax  =  A  ax-  Illustrate. 

22.  Prove:  (Tax-^-b  =  A  ax-  Illustrate. 
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23.  Prove:  2(2/  -  ax)  =  0. 

24.  Prove:  2(2/  -  axy  =  iV(aVi  +  orf)  -  2a2xy. 
26.  Prove:  XX  •  27  does  not  equal  XX Y, 

26.  Prove:  2(F  -  mZ  -  6)  =  iV(MF  -  mMx  -  6). 

27.  Supposing  that  the  frequencies  of  the  Z- values  are  the  terms  of  the 
binomial  expansion  (q  +  p)""  as  indicated  in  the  table,  find  ax  if  (q  +  p) 
=  1.  Hint:  complete  the  table  as  shown  and  recall  that  2X/(x)  =  np. 
[See  Exercise  42,  p.  110.] 


X 

fix) 

X{X  -  Dfix) 

0 

q- 

1 

nq^'-^p 

2 

2  ^ 

n 

28.  On  the  X-axis  are     fixed  points  Xi,  X2,  .  .  . ,  A',v  and  an  unknown 

point  X.   Find  X  so  that  2(Xt  —  X)^  is  a  minimum.    Compare  with 

t=i 

theorem  on  page  131. 

29.  In  the  result  of  Exercise  27  above,  substitute  n  =  10,  p  =  5  =  1/2, 
and  find  o".    Compare  your  result  with  that  found  in  Exercise  3  of  this  set. 
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SKEWNESS:  EXCESS:  MOMENTS 

38.  INTRODUCTION 

The  two  preceding  chapters  have  been  concerned  with  char- 
acterizing masses  of  numerical  data  by  means  of  certain  summarizing 
numbers.  These  summarizing  numbers  have  in  general  been  well- 
defined  statistical  constants  that  were  designed  to  measure  central 
tendency  and  dispersion.  With  the  computation  of  these  constants 
the  distribution  has  been  partially  characterized  and  described. 

When  we  say  of  a  distribution  of  heights,  for  example,  that  it  shows 
a  mean  of  67.5  inches  and  a  standard  deviation  of  2.5  inches,  we  know 
that  approximately  two-thirds  of  the  total  frequency  is  found  within 
the  interval  65  to  70;  that  it  is  extremely  unlikely  that  any  member 
of  the  distribution  will  be  found  without  the  limits  67.5  db  3(2.5); 
that  the  total  range  is  about  6(2.5)  inches.  If  other  summarizing 
numbers  such  as  Qi,  Qa,  Mo,  etc.  are  given,  then  our  knowledge  of  the 
distribution  is  considerably  enlarged.  The  main  purpose  of  these 
summarizing  numbers  is  to  assist  us  in  comprehending  the  important 
features  of  a  distribution  though  the  distribution  may  not  be  present 
before  us. 

39.  THE  MEANING  OF  SKEWNESS 

Our  confidence  in  the  conclusions  mentioned  in  the  preceding  sec- 
tion is  especially  strengthened  by  the  knowledge  that  the  distributions 
of  heights  of  men  chosen  at  random  are  fairly  symmetrical.  However, 
we  can  conceive  of  a  city  police  force  constituted  of  men  at  least 
65  inches  in  height,  that  the  symmetry  of  such  a  selected  group 
would  be  greatly  disturbed  by  the  selectivity,  and  that  the  range 
of  values  greater  than  the  mean  would  be  longer  than  the  range  of 
values  less  than  the  mean.  This  characteristic  feature  of  lack  of 
symmetry  in  distributions  is  usually  called  skewness  or  asymmetry. 

In  the  preceding  chapter  emphasis  was  placed  upon  the  fact  that 
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dispersion  is  concerned  with  the  amount  of  the  variation  rather  than 
with  its  direction.  We  feel  the  need  for  a  statistical  constant  which 
will  summarize  the  direction  of  the  variation  or  the  departure  from 
symmetry.  And  just  as  we  found  it  advisable  to  measure  dispersion 
for  purposes  of  comparison  by  measures  of  relative  variability,  so  for 
purposes  of  comparison  we  must  invent  measures  of  relative  skewness. 
Owing  to  the  fact  that  skewncss  is  dependent  upon  the  amount  of 
dispersion,  the  coefficients  of  relative  skewness  are  obtained  by  di- 
viding the  ablolute  skewness  by  some  measure  of  absolute  dispersion. 
This  method  will  result  in  ratios  or  abstract  numbers  which  are 
independent  of  the  units  in  which  the  original  variates  are  measured. 


40.   THE  MEASUREMENT  OF  SKEWNESS 

It  is  an  obvious  fact  that  in  unimodal  s}  mmctrical  distributions 
the  mean,  the  median,  and  the  mode  coincide.  Also  in  symmetrical 
distributions  the  numerical  distances  from  the  median  to  the  lower 
and  upper  quartiles  are  equal,  and  certain  pairs  of  deciles  are  equi- 
distant from  the  median.  As  the  distribution  departs  from  sym- 
metry there  is  a  separation  of  the  three  measures  of  central  tendency, 
the  difference  l)etween  the  mean  and  the  mode  being  greatest.  Also 
skewness  is  indicated  when  the  distances  from  the  median  to  the 
quartiles  become  unequal,  and  when  pairs  of  deciles  are  not  equi- 
distant from  the  median.  Evidently  any  of  these  differences  can 
be  made  the  bases  for  measurements  of  skewness. 

Since  the  mean  and  the  median  are  pulled  away  from  the  mode 
in  the  direction  of  the  skew,  or  the  tail  of  the  curve  representing  the 
extreme  measures,  an  evident  measure  of  absolute  skewness  could 
be  taken  to  be  M  —  Mo.  Professor  Karl  Pearson  has  used  this  as 
the  basis  for  his  formula  for  relative  skewness,  namely: 

Sk  =  (1) 

If  the  mean  is  to  the  right  of  the  mode,^  that  is  if  M  >  Mo,  as  in 
curve  yl,  the  skewness  is  positive,  whereas  if  the  mean  is  to  the  left 
of  the  mode,  that  is,  M  <  il/o,  as  in  curve  C,  the  skewnass  is  negative. 


1  See  Figures  16  and  17. 
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If  the  mean  and  the  mode  coincide,  as  in  curves  B  and  Z),  the  skew- 
ness  is  zero. 

The  formula  (1),  known  as  Pearson^s  formula,  is  open  to  the  objec- 
tion that  in  many  distributions  there  is  no  well-defined  mode.  Since 
in  many  distributions  the  approximate  relation 

M  -  Mo  =  3(M  -  Md) 

has  been  found  to  obtain,  this  relation  suggests  the  use  of  the  alter- 
native Pearson  form:  ^ 


^  3(M  ~  Md) 


(2) 


Figure  16 

fix) 


Curve  A  shows  positive  skewness  while  curve  13  is  symmetrical. 

Since  in  measuring  skewness  we  are  interested  in  the  degree  of 
asymmetry  a  coefficient  of  skewness  is  always  an  index  that  may  be 
used  to  compare  the  unsymmetrical  distribution  with  a  symmetrical 
one  that  we  superimpose.  Thus,  in  Figure  16  we  may  consider  A 
as  the  frequency  curve  for  a  given  distribution  and  B  as  the  sym- 
metrical curve  that  is  drawn  to  display  the  skewness  in  A.  We 
indicate  on  the  figure  that  the  area  bounded  by  the  curves  A  and  B 
and  the  X-axis  causes  the  skewness  in  A,  We  can  make  a  similar 
statement  about  the  curves  shown  in  Figure  17. 
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Figure  17 

Ax) 


Curve  C  shows  negative  skewness  while  curve  D  is  symmetrical. 


Let  us  bo  more  specific  and  consider  the  four  distributions  of 
Table  27.  The  student  should  verify  the  statistical  constants  per- 
taining to  each  distribution.  We  have  drawn  histograms  of  these 
distributions  (see  piige  155)  and  on  them  we  have  located  the  points 
which  mark  the  position  of  the  arithmetic  mean  and  the  median, 
and  the  distance  which  indicates  the  value  of  the  standard  deviation. 
The  coefficicaits  of  skewness,  since  they  are  pure  numbers  or  indexes, 
cannot  of  course  be  shown  on  the  graplis. 

The  reader  should  be  warned  that  coefficients  of  skewness  like 
all  relative  numbei-s  may  not  mean  much  until  he  has  had  a  con- 
siderable experience  with  many  and  varied  distributions.  Only  by 
drawing  the  histograms,  marking  on  them  the  points  for  M  and  Mdj 
and  the  distance  for  or,  then  computing  ^SA:  by  any  of  our  formulas 
and  comparing  the  results  for  several  distributions  —  the  more  the 
better  —  mil  these  values  take  on  a  real  meaning. 

It  has  been  shown  ^  that  (M  —  Md)/^  lies  between  ~  1  and  +  1, 
and  thus  skewness  computed  by  formula  (2)  is  always  between  ~  3 
and  +  3.  This  measure  of  skewness  is  obviously  quite  sensitive. 
While  it  is  dangerous  to  set  limits  on  such  indexes,  we  may  say,  as 

*  Harold  ITotelling  and  I^onard  M.  Solomons,  Annals  of  Mathematical  Sta- 
tistics, May  1932,  pp.  141-142. 
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Table  27 


Class 

X 

A 

JW 

B 

c 

JW 

D 

87.5-92.5 

90 

0 

4 

0 

4 

82.5-87.5 

85 

12 

4 

4 

8 

77.5-82.5 

80 

24 

20 

40 

20 

72.5-77.5 

75 

28 

44 

24 

24 

67.5-72.5 

70 

24 

20 

20 

40 

62.5-67.5 

65 

12 

4 

8 

4 

57.5-62.5 

60 

0 

4 

4 

0 

M 
Mi 

Sk  by  (2) 
as 

100 
75 
75 
6 
0 
0 

100 
75 
75 
6 
0 
0 

100 
75 

76.25 
6 

-  0.625 

-  1.2 

100 
75 

73.75 
6 

+  0.625 
+  1.2 

a  rough  measuring  stick,  numerical  values  of  skewness  computed 
by  (2)  less  than  0.25  may  be  considered  small,  numerical  values 
between  0.25  and  0.5  as  moderate,  and  numerical  values  greater 
than  0.5  as  large.  Numerical  values  as  large  as  1  are  unusual. 

Exercise,  a.  For  each  of  the  following  distributions  compute  M, 
Md,  or,  and  Sk> 

b.  Draw  the  histogram  and  indicate  upon  it  the  points  M,  Md, 
and  the  distance  <t. 


A 

B 

C 

D 

X 

fix) 

m 

m 

35 

4 

16 

2 

1 

30 

12 

48 

4 

2 

25 

20 

12 

8 

3 

20 

28 

10 

10 

4 

15 

20 

8 

12 

15 

10 

12 

4 

48 

25 

5 

4 

2 

16 

50 

Total 

100 

100 

100 

100 
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Figure  18 
t 


< — 

Histogram  for  A 
of  Table  27 

— > 

67.5  67.5     M'-Md    82.5  92.5  X 


Ilistogrmn  for  B 
of  Table  m 


57.5 


67.5     M'-Md     82. 5 


02. 5 


Ilisfogram  for  C 
of  Table  27 


57.5 


57.5 


67.5  M 


Okfrf    82.5  92.5  X 

myogram  for  D 
of  Table  27 


67.5 


82.5 


92.5 
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A  third  measure  of  skewness  that  has  become  well  known  is  that 
due  to  Bowley.  It  is  based  upon  the  fact  previously  mentioned  that 
in  an  asymmetrical  distribution  the  numerical  distances  from  the 
median  to  the  lower  and  the  upper  quar tiles  are  unequal.  If  and 
q2  are  defined  by  the  equations 

qi  =  Md  -  Qi  and  g2     Qz  -  Md 

then: 

92+91  Qz  -  Qi 

In  regard  to  formula  (3),  Bowley  says: 

If  the  curve  is  symmetrical,  =  ^i,  and  Sk  —  ^]  if  q2  >  ^i,  Sk  is  posi- 
tive, and  if  q2  <  q^  ^k  is  negative.  Sk  becomes  -h  1  if  g-i  =  0,  that  is,  if 
the  median  and  lower  quartile  coincide;  and  aS'A:  becomes  —  1  if  ^2  =  0. 
Sk  is  therefore  a  measurement  which  never  exceeds  1  numerically,  and  has 
a  definite  significance  at  zero  and  at  its  extreme  values.  .  .  .  The  signif- 
icance of  the  various  values  can  only  be  obtained  by  experience,  but  it  may 
be  suggested  that  0.1  is  a  moderate  degree  of  skewness,  and  0.3  a  consider- 
able degree.^ 

The  quartile  measure  of  skewness  is  rigidly  defined,  is  simple  to 
compute,  and  is  easily  understood.  It  is  a  pure  number,  and  the  re- 
striction of  its  value  to  the  small  interval  from  —  1  to  +  1  leaves  it 
sufficiently  sensitive  for  many  needs.  A  just  criticism  is  that  it 
fails  to  take  into  consideration  the  size  of  the  extreme  variations. 
Since  the  main  question  in  skewness  is  the  determination  of  how 
much  more  the  items  deviate  on  one  side  of  the  mean  than  on  the 
other,  the  ideal  measure  of  skewness  should  give  due  emphasis  to 
the  extreme  variations. 

Many  of  the  objections  to  the  previously  mentioned  methods  for 
measuring  skewness  may  be  met  by  returning  to  a  consideration  of 
the  deviations  of  the  variates  from  their  mean.  Since  we  are  in- 
terested in  how  the  variates  are  situated  with  respect  to  the  mean 
and  since  we  wish  to  give  emphasis  to  the  extreme  measures,  we  re- 
quire some  function  of  the  form 

for  some  value  of  n.  Now  if  n  is  even,  we  obtain  the  amount  and 
not  the  direction  of  the  variation.  In  order  to  secure  the  direction  of 

^  Bowley,  op,  ciLf 
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the  variation,  we  are  compelled  to  use  odd  numbers  for  n.  If  n  =  1, 
So:"*  =  0.  If  n  =  3,  we  obtain  l^x^f{x),  a  basic  factor  in  our  next 
measure  for  skewness  known  as  as  (read:  alpha  three),  is  defined 
as  the  third  moment  of  the  distribution  about  the  mean  divided  by 
the  cube  of  the  standard  deviation,  or  by  the  equation: 


Sx3/(x) 
N 


rS 


that  is 


P3 

a3  =  -3  = 


—    the  third  moment  about  M 
cube  of  the  standard  deviation 

nu  three 


(4) 


sigma  cube 


In  what  follows  we  shall  consider  as  as  the  preferable  measure  of 
skewness.^ 

As  an  illustrative  problem  we  shall  compute  as  for  the  distribution 
of  grades  in  college  algebra.  The  table  will  be  a  continuation  of 
Table  23,  page  126. 

Table  28.   Computing  0-3  for  the  Distribution  of  Grades 
IN  College  Algebra  by  the  Definition  M  =  74.48 


X 

fix) 

X 

x'fix) 

x^fix) 

95 

4 

20.52 

1,684.2816 

34,561.458432 

90 

6 

15.52 

1,445.2224 

22,429.851648 

85 

12 

10.52 

1,328.0448 

13,971.031296 

80 

19 

5.52 

578.9376 

3,195.735552 

75 

37 

0.52 

10.0048 

5.202496 

70 

24 

-  4.48 

481.6896 

-  2,157.969408 

65 

11 

-  9.48 

988.5744 

-  9,371.685312 

60 

6 

-  14.48 

1,258.0224 

-  18,216.164352 

55 

4 

-  19.48 

1,517.8816 

-  29,568.333568 

50 

2 

~  24.48 

1,198.5408 

-  29,340.278784 

Total 

125 

10,491.2000 

-  14,491.152000 

V2 


10491.2 


(T  = 
0-'  = 


125 
9.16  c.u. 

768.795136  (c.u.)' 


=  83.9296  (cu.):* 


1  at  is  zero  for  the  normal  curve.  See  page  405. 
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iV  125 

j/3      -  115.929216 
=  7»  =    768.795136  ^'^^^^ 

as  is  a  very  refined  measure  of  skewness.  The  process  of  cubing 
maintains  the  proper  signs  for  the  deviations  and  also  gives  emphasis 
to  the  extreme  variates.  Further,  the  division  by  reduces  the 
measure  to  an  abstract  number.  Hence  it  is  a  coefficient  of  relative 
skewness  and  is  independent  of  the  unit  of  measure.  Since  it  is  not 
restricted  in  its  range,  it  is  a  very  sensitive  measure,  the  sensitiveness 
being  emphasized  by  the  cubing  of  the  deviations.  Its  chief  dis- 
advantage is  the  apparent  labor  of  computing  it.  We  shall  greatly 
overcome  this  apparent  trouble  in  Section  44  (p.  164)  by  developing 
a   short  method." 


41.   EXCESS  OR  KURTOSIS 

In  elementary  statistics  a  distribution  is  usually  satisfactorily 
characterized  by  the  measures  of  central  tendency,  the  measures  of 
dispersion,  and  the  measures  of  skewness,  or  more  briefly,  by  M, 
(T,  and  0:3.  We  may  add  one  other  important  constant  to  the  sum- 
marized description  by  considering  the  relative  number  of  the 
variates  in  the  immediate  neighborhood  of  the  mean  or  mode.  This 
measure  of  relative  flatness  (or  peakedness)  of  a  curve  fitted  to  the 
distribution  as  compared  with  that  of  the  normal  curve  fitted  to  the 
same  distribution  is  called  a  measure  of  excess  or  kurtosis. 

The  excess  or  kurtosis  is  measured  by: 

Sx*/(x) 

=  a4  -  3  =  -  3  =      -  3  (5) 

Again  the  normal  curve  is  our  standard  for  comparison.  Since 
a4  =  3  for  the  normal  curve  (see  page  405)  the  excess  for  any  other 
curve  is  merely  a  comparison  of  its  with  that  of  the  normal  curve 
which  has  the  same  standard  deviation. 

If  the  excess  is  positive  (leptokurtic),  the  number  of  variates  near 
the  mean  is  greater  than  in  a  normal  distribution.  If  the  excess  is 
negative  (platykurtic),  the  curve  is  more  flat-topped  than  the 
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corresponding  normal  frequency  curve.  The  normal  curve,  in  which 
a4  =  3,  is  said  to  be  mesokurtic. 

Figure  19,  exhibiting  three  curves  with  the  same  mean  and  the 
same  standard  deviation,  illustrates  graphically  the  meaning  of  excess. 


Figure  19 


C'urvc  .1  is  platykurtic  and  a4  —  3  <  0;  curve  B  is  mesokurtic 
(normal),  and  a4  —  3  =  0;  and  curve  C  is  leptokurtic  and  a4  —  3  >  0. 


42.   THE  UNADJUSTED  MOMENTS  OF  A  DISTRIBUTION 

In  the  preceding  chapters  we  have  several  times  mentioned  the 
term  moment.  It  is  a  concept  so  important  in  statistical  analysis 
that  we  cannot  longer  defer  its  more  complete  consideration.  We 
shall  soon  learn  that  a  statistical  distribution  —  which  we  have 
characterized  by  its  mean,  its  standard  deviation,  its  ske\vness,  its 
excess  —  is,  in  brief,  characterized  by  its  moments. 

Further,  the  notion  of  moments  serves  as  a  guide  in  curve-fitting. 
It  was  remarked  in  Section  16  (p.  38)  that  the  total  area  under  a 
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frequency  curve  should  equal  the  area  of  the  histogram,  which  is 
another  way  of  saying  that  the  total  frequency  should  be  unchanged. 
As  the  total  frequency  is  the  zeroth  moment,  this  is  equivalent  to 
requiring  that  the  zeroth  moment  of  the  frequency  curve  equal  the 
zeroth  moment  of  the  given  distribution.  In  like  manner,  we  may 
require  that  a  sufficient  number  of  successive  moments  of  higher 
orders  of  the  frequency  curve  be  equal  to  the  corresponding  higher 
moments  of  the  given  distribution.  This  is  the  so-called  principle 
of  moments  ^  for  the  determination  of  the  parameters  in  a  curve 
which  is  to  be  selected  to  represent  a  given  distribution.  We  shall 
have  an  opportunity  to  observe  an  application  of  the  principle  of 
moments  in  Chapters  7,  12,  and  13. 

The  moments  of  a  distribution  can  be  computed  about  any  point 
at  pleasure.  They  can  be  expressed  in  various  units.  The  most 
significant  moments  are  referred  to  the  mean,  il/,  and  are  usually 
expressed  either  in  the  given  or  the  class  unit.  They  may  then  be 
reduced  to  abstract  numbers  by  dividing  by  the  appropriate  powers 
of  c  as  was  done,  for  example,  in  defining  as  and  a^.  As  illustra- 
tions, we  have  learned  that  if  x  equals  the  deviation  of  any  frequency 
from  M  expressed  in  the  given  unit,  then : 

—  ^^/W  _  ^®      moment  of  the  distribution  about 
^     N     ~     M  expressed  in  the  given  unit  =  0 

—  ^^yW  _  ^®  2nd  moment  of  the  distribution  about 
^2  —     ^     —         expressed  in  the  (given  unit)^  =  cr^ 

—  ^^^/W  _  ^®  3rd  moment  of  the  distribution  about 

N     "     M  expressed  in  the  (given  unit)^ 

—  ^^/W  _  ^®       moment  of  the  distribution  about 
"     N  M  expressed  in  the  (given  unit)^ 

etc. 

Hence  in  general  we  define: 

Sx'*/(x)     the  nth  moment  of  the  distribution  about 
N  M  expressed  in  the  (given  unit)" 

If  n  =  0,  we  have: 


(6) 


Vo  -  -jj—  -  1 


^  Rietz  and  others,  op.  cit.,  p.  68. 
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In  our  previous  discussion,  not  only  have  we  encountered  moments 
about  M  but  about  other  points  as  well.  Formula  (7),  page  9,  in- 
volves, for  example: 

=  the  1st  moment  about  zero  =  M 

N 

and 

XXH(x) 

— r— -~  =  the  2nd  moment  about  zero. 

N 

The  higher  moments  about  zero  are  similarly  defined. 

In  computing  the  standard  deviation  we  noted  that  the  arithmetical 
computations  were  simpHfied  by  referring  the  variates  to  some  point 
near  the  mean  and  expressing  them  in  class  vmits  (see  Table  25, 
p.  130).  We  shall  soon  discover  that  this  transformation  to  class 
units  is  especially  useful  when  computing  the  higher  moments. 
In  comj)uting  a.^  (Table  28,  p.  157)  we  felt  a  need  for  some  short 
method. 

On  pages  72  and  130  we  have  noted  that  if  a*'  ecjuals  the  deviation 
of  any  frcfiueiicy  from  the  assumed  origin  0'(/?,  0)  expressed  in  the 
class  unit,  then: 

,  _  Sx7(x)  _  the  1st  moment  of  the  distribution  about 

^     N  ~     O'  expressed  in  the  class  unit  =  bx 

'  -  _      2nd  moment  of  the  distribution  about 
N  O'  expressed  in  the  (class  unit)* 

Hence  in  general  we  may  define: 

2x'7(^)     the  nth  moment  of  the  distribution  about 


N  O'  expressed  in  the  (class  imit)" 


(7) 


If  10  =  the  class  width,  class  units  =  wx^  given  units,  and 
hence : 

'  -  ^('^^0"/(^)  _  wZx^^'fix)     the  nth  moment  about  0' expressed 
N        ^       N        '~     in  the  (given  unit)'* 

Therefore  we  have  the  theorem:  The  nth  movieni  of  a  given  distribu- 
tion about  any  point       in  the  nth  poxcer  of  the  given  unit  equals 
times  its  nth  moment  about  O'  in  the  nth  power  of  the  class  unit.  Or 
in  short: 

Vn  in  the  (given  unit)"  =  w^Pn  in  the  (class  unit)"  (8) 
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While  the  most  significant  moments  are  those  computed  about  the 
mean,  yet  their  computation  directly  from  the  definition,  is  very 
tedious,  owing  to  the  fact  that  M  usually  involves  several  decimals, 
and  hence  x  =  X  —  M 

also  involves  several  decimals.  Raising  these  decimals  to  the  third, 
fourth,  and  higher  powers  is  laborious,  even  with  the  aid  of  a  cal- 
culating machine.  However,  just  as  we  were  able  to  avoid  this 
tedium  with  the  short  method  for  computing  a  (Table  25,  p.  130), 
so  we  shall  avoid  it  in  computing  the  higher  moments. 
From  Figure  1  (p.  73)  we  have  noted  that: 

X  =  wx'  ~  wbx  =  w{x'  —  bx)  expressed  in  the  given  unit 
Hence : 

_  SjcyCjc)  _  2u;g(x^  -  bx)V(x)  _  l^iv^lx'^  -  2h,x'  +  ba/(jc) 


(9) 


In  like  manner,  it  follows  that: 
vz  =  w\v'^  -  Zvlfi:,  +  26?) 

etc. 

These  moments  described  in  (9)  of  course  express  the  ^'s  in  given 
units.  If  the  class  interval  is  taken  as  the  unit,  which  is  usually  the 
case,  w  —  Ij  and  then  the  moments  are  expressed  in  class  units. 
If    =  0,  vii  becomes  the  nth  moment  about  zero  as  origin. 

We  have  found  it  desirable  to  express  M  and  a  in  terms  of  the 
given  unit.  However  the  third,  fourth,  and  higher  moments  are 
usually  expressed  as  ratios  in  such  a  manner  that  they  are  independent 
of  the  unit  of  measure.  This  was  accomplished  in  defining  as  and 
a4  by  dividing     and  Va  by  cr^  and  (t^  respectively.^  Thus: 

^  an  =  —  is  the  nth  moment  about  M  expressed  in  (standard  units)**.  A 
variate  is  expressed  in  standard  units  by  dividing  its  deviation  from  M  by  cr.  It 
Is  usually  indicated  by  t.   Thus,  tt  =  ^  —  =  • 
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and  in  general 

In  particular  we  note  that: 

ai  =  0   and    Qf2  =  1 

The  moments 

Vn  =  and  - 

are  frequently  called  the  crude  or  unadjusted  moments  about  the  mean 
and  assumed  point  respectively.  The  standard  deviation,  the  skew- 
ness,  and  the  excess,  if  based  upon  them,  are  called  the  unadjusted 
standard  deviation^  the  unadjusted  skewness^  etc. 


43.   THE  ADJUSTED  MOMENTS:  SIIEPPARD'S 

CORK  ECTIONS 

In  arranging  our  data  in  a  frequency  distribution  we  have  assumed 
that  the  items  in  a  given  class  were  concentrated  at  its  mid-point. 
This  procedure  introduces  a  sHght  error,  which  we  call  a  grouping 
error.  By  a  process  too  abstruse  for  consideration  here,  certain 
corrections  —  known  as  Sheppard's  Corrections  —  have  been  devised 
to  assist  in  correcting  the  errors  in  the  moments  due  to  grouping. 
When  applied  to  the  crude  moments  they  give  the  adjusted  moments 
of  the  distribution.  It  is  quite  customary  to  denote  the  adjusted 
moments  by  ju„  (read:  mu  enn),  n  =  1,  2,  3,  etc.  They  find  their 
widest  application  in  fitting  a  frequency  function  to  a  distribution 
of  observed  measurements  by  what  is  known  as  the  method  of 
moments. 

The  adjusted  moments  are  not  generally  recommended  for  use  in 
unrefined  stjitistical  analysis.  Especially  is  this  true  if  the  original 
data  were  not  taken  with  sufficient  accuracy  to  warrant  our  using  the 
niceties  of  analysis  that  are  implied  in  the  corrections,  for  certainly 
we  should  not  adopt  methods  in  computation  that  are  inconsistent 
with  the  data  at  hand.  A  more  potent  reason  for  our  failure  to 
recommend  their  employment  generally  is  due  to  the  fact  that  an 
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intelligent  use  of  them  requires  a  knowledge  of  their  development.^ 
Finished  statisticians  use  them  with  care  and  discrimination.  We 
do  not  wish  to  discourage  their  use  when  the  data  warrant  it  and 
when  they  can  be  employed  with  safety  and  confidence,  but  we  do 
insist  that  they  should  be  used  with  understanding.  We  mention 
them  here  to  add  completeness  to  our  text,  to  illustrate  the  method 
of  computing  them  to  the  student  who  may  continue  the  study  of 
statistical  analysis  beyond  this  introductory  text,  and  to  caution 
the  reader  against  their  indiscriminate  use. 

The  adjusted  moments  involving  Sheppard's  Corrections  are 
given  by  the  following  equations: 


M«  =  fa  -  j2 

Ms  =  V3 

M4  =  »'4--2-  +  24o 


Sheppard's  Corrections  if  moments 
are  expressed  in  the  given  unit 


where  w  is  the  class  interval. 

If  the  moments  are  expressed  in  the  class  unit,  w  =  ly  and  the 
simplifications  are  evident. 

The  refined  or  adjusted  formulas  for  the  standard  deviation,  the 
skewness,  and  the  excess  are  given  by: 

cr  =  V/x^,       as  =  ^)       0:4  —  3  =      —  3 

No  corrections  are  applied  to  the  moments  of  theoretical  distributions 
and  curves.  In  such  cases  we  indicate  the  nth  moment  about  M  by  /Xn  and 
about  any  other  point  by  fXn. 

44.   COMPUTATION  OF  THE  MOMENTS 
The  order  of  procedure  when  computing  the  moments  should  be : 

1.  Choose  a  convenient  arbitrary  origin,  and  compute  v[y  v[,  v^,  v^. 

2.  Transfer  the  moments  to  the  mean  by  means  of  equations  (9),  and 
thus  compute  vi^  V2j  V3j  va.   See  that  the  proper  units  are  included. 

3.  If  Sheppard^s  Corrections  are  to  be  applied,  use  equations  (10)  and 
compute  jUi,  )U2,  Ma,  M4. 

We  shall  illustrate  this  procedure  by  computing  the  moments  of 
the  following  distribution: 

^  Rietz  and  others,  op.  aY.,  pp.  92  ei  seq. 
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Table  29.  Frequency  Distribution  of  Pulse  Beats  per  Minute 

IN  English  Convicts  ^ 


X 

fix) 

x' 

a:7(a;) 

x''f(x) 

xVix) 

46  5 

2 

—  8 

—  16 

128 

—    1  024 

8  192 

50  5 

5 

—  7 

—  35 

245 

m0  XC/ 

—   1  715 

X  1  1  XtJ 

12  005 

54  5 

17 

—  6 

—  102 

612 

—   3  672 

22  032 

58  5 

57 

—  5 

—  285 

1  425 

—   7  125 

35  625 

62  5 

90 

—  4 

—  360 

1  440 

X  J  X  X  V/ 

—   5  760 

23  040 

66  5 

150 

—  450 

1  350 

—   4  050 

12  150 

X^,  Xl/Vf 

70  5 

120 

—  2 

—  240 

480 

—  960 

1  920 

74  5 

131 

—  1 

—  131 

131 

—  131 

X  C/  X 

131 

XCFX 

78  5 

109 

0 

000 

000 

000 

000 

82.5 

86 

1 

86 

86 

86 

86 

86.5 

62 

2 

124 

248 

496 

992 

90.5 

42 

3 

126 

378 

1,134 

3,402 

94.5 

1  c 

15 

4 

60 

240 

960 

3,840 

98,5 

18 

5 

90 

450 

2,250 

11,250 

102.5 

9 

6 

54 

324 

1,944 

11,664 

106.5 

5 

7 

35 

245 

1,715 

12,005 

110.5 

3 

8 

24 

192 

1,536 

12,288 

114.5 

3 

9 

27 

243 

2,187 

19,683 

Total 

924 

-  993 

8,217 

~  12,129 

190,305 

Choosing  h  =  78.5,  we  have  for  the  j^"s: 

=  =  6.  =  =  -  1.0746753 

v:  =  =  w  =  8.8928571 

N  924 


A  =  =  "  "  13.12662338 

=  ?£_p:)  =  =  205.9577923 

Using  equations  (9)  we  shall  now  express  the  in  the  given  unit. 
We  have,  noting  that  ii?  =  4 : 

^1  =  0 

j/2  -  16[8.8928571  -  (-  1.0746753)2] 

-  16(8.8928571  -  1.1549270)  =  16(7.73793014) 

^  The  data  are  taken  from  Biomeirika,  Vol.  11. 
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f3  =  64C-  13.12662338  -  3(8.8928571) (-  1.0746753) 

+  2(-  1.0746753)'] 
=  64(-  13.12662338  +  28.67080162  -  2.48234304) 
=  64(13.06183520) 
Vt  =  256[205.9577923  -  4(-  13. 12662338)  (-  1.0746753) 

+  6(8.8928571) (-  1.0746573)^  -  3(-  1.0746753)^] 
=  256(205,9577923  -  56.42753004  +  61.62360463  -  4.0015692) 
=  256(207.1522977) 

Assuming  that  Sheppard's  Corrections  may  be  applied,  we  find  the  ^'s: 
fii  =  Ui  =  0 

H2  =  16(7.73793014)  -  16(.08333333)  =  16(7.65459681) 
„  j;,  =  64(13.06183519) 

M4  =  256C207. 152298  -  ^56(7.73793014)  _^  256(.029167)] 
=  256(203.312500) 

Hence  we  have: 

the  unadjusted  constants 

M  =  78.5  +  4(-  1.074675)  =  74.2  p.b. 

(T  =        =  4(2.7817135)  =  11.12685  p.b. 

_      _  64(13.06183520)  _  ^ 

~      ~  ^21:52470467  ~  ^'^^^ 

^  256(207.1522977)  ^04^07 
256(59.875562) 
X  =  a4  _  3  =  0.4597 

and  the  adjusted  constants 
M  =  74.2  p.b. 

c  =  VJTi  =  4(2.7667)  =  11.0608  p.b. 
a3  =  ^  =  ^  =  0.6168 

=      ^  256(203.312500) 
<T*  256(58.592852) 

=  3.46992 
K  i=  a4  -  3  =  0.46992 
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The  student  will  note  that  the  application  of  Sheppard's  Corrections 
here  has  affected  the  constants  slightly. 

Assuming  that  the  parent  population  is  normal,  the  values  of  a3,„ 
and  a4,u  of  the  \mi verse  are  usually  written: 

a3,u  =  (the  computed  as)  db  0.6745  y/^ 


ociyu  ~  (the  computed       =t  0.6745y  ~ 


These  statements  mean  that  the  chances  are  even,  or  it  is  equally 
likely,  that  the  computed  values  of  az  and  ^4  do  not  differ  numerically 
more  than  the  spc^'ified  amounts  from  the  true  values,  az,u  and  a^^w 

For  the  illustrative  problem  we  are  considering  we  have: 


M  =  74.2  ±  0.2456 
(7  =  11.0068  ±  0.1736 
az  =  0.6168  ±  0.0544 
a4  =  3.46992  ±  0.1088  j 


(See  Section  37,  p.  142.) 


EXERCISES 

1.  Using;  Bowlcy's  coefiiciont,  formula  (3),  find  the  skewness  for  the  dis- 
tribution of  grades  in  college  algebra  as  given  in  Table  8  (p.  26). 

2.  Using  Pearson's  formula  (2),  find  the  skewness  for  the  distributions 
of  heights  and  weights  described  in  Exercise  1,  page  54. 

3.  Using  Rowley's  coefficient,  find  the  skcwness  for  the  distributions 
of  heights  and  weights  described  in  Exercise  1,  page  54. 

4.  Find  cr,  as,  and  a\  for  the  distributions  of  Exercise  4,  page  102. 

6.  Continue  the  analysis  of  Exercise  6  on  page  147  by  finding  az  and  a4 
for  the  distributions  described  in  Exercise  15,  page  105. 

6.  If  the  class  interval  is  taken  as  a  unit,  i.e.,  \i  lo  —  1^  show  that: 

V2  =  v[  —  h\, 

Vi  —  1^3  —  'Sv2bx  —  bl, 

Va  —      —  Apibj:  —  idPibl  —  bi 

7.  Compute  M,  or,  as,  and  ^4  for  the  distribution  in  Exercise  2,  page  54. 

8.  Compute  az  and  ^4  for  the  data  of  Exercise  19  at  the  end  of  this 
chapter. 

9.  Compute  az  and  04  for  the  data  of  Exercise  24  at  the  end  of  this 
chapter. 
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10. 

Compute  My  Ma,  Mo,  o",  az, 
and  a4  for  this  table  of  chest 
measurements. 


11. 


Compute  M,  Ma,  Mo,  (t,  az, 
and  a4  for  this  table  of  heights. 


The  Chest  Measurements 
OF  10,000  Men 

(Original  measurements  to  the 
nearest  inch)  ^ 


Distribution  of  Heights 
6,441  Colored  Soldiers 

(Original  measurements  to  the 
nearest  centimeter)  * 


A 

el  \ 

X 

fix) 

33 

6 

148.5 

2 

OA 

35 

150.5 

9 

35 

125 

152.5 

13 

36 

338 

154.5 

23 

37 

740 

156.5 

56 

38 

1,303 

158.5 

88 

39 

1,810 

160.5 

162 

40 

1,940 

162.5 

318 

41 

1,640 

164.5 

468 

42 

1,120 

166.5 

564 

43 

600 

168.5 

665 

44 

222 

170.5 

708 

45 

84 

172.5 

749 

46 

30 

174.5 

747 

47 

5 

176.5 

586 

48 

2 

178.5 

469 
314 
207 
133 

Total 

10,000 

180.5 
182.5 
184.5 

186.5 

70 

188.5 

38 

190.5 

22 

192.5 

15 

194.5 

10 

196.5 

3 

198.5 

2 

Total 

6,441 

^  The  data  are  taken  from  E.  T.  Whittaker  and  George  Robinson,  The  Cal- 
culus of  Observations,  1924,  p.  189. 

*  The  data  are  taken  from  Annual  Report  of  the  Surgeon  General,  Medical 
Department  of  the  United  States  Army,  Vol.  XV,  Pt.  I,  p.  522. 
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45.   RETROSPECT  AND  PROSPECT 

We  have  now  come  to  the  end  of  our  first  important  statistical 
problem,  the  elementary  analysis  of  a  simple  frequency  distribution. 
This  analysis  has  been  accomplished  by  computing  certain  statistical 
constants  and  making  simple  and  concise  statements  about  them.  If 
a  distribution  is  fairly  symmetrical,  the  arithmetic  mean  and  the 
standard  deviation  are  usually  sufficient  to  give  a  numerical  de- 
scription. If  it  is  skew,  then  a  coefficient  of  skewness  is  included. 
If  further  refinement  is  desired,  a  coefficient  of  kurtosis  is  computed. 
Each  computed  parameter  adds  to  our  information  about  the  dis- 
tribution in  (question. 

To  proceed  further  into  the  analysis  of  a  frequency  distribution 
would  take  us  into  the  study  of  frequency  curves  which,  as  we  have 
previously  stated,  is  beyond  the  scope  of  this  text.  While  in  a  later 
chapter,  we  do  consider  the  normal  frequency  curve  (see  (chapter  12), 
the  study  of  skew  freciuency  curves  would  take  us  too  far  afield. 
This  is  a  topic  to  which  the  student  trained  in  the  calculus  and  ele- 
mentary statistical  analysis  may  look  forward. 

The  second  problem  that  we  shall  consider  is  the  important 
f)roblem  of  correlation.  However,  before  we  approach  it,  we  shall 
deviate  somewhat  from  our  course  and  give  a  brief  consideration  to 
the  application  of  averages  to  Index  Numbers. 


MISCELLANEOUS   QUESTIONS  FOR  REVIEW 

1.  Find  the  sums: 

100  50 

(1)     :i  (2X  4-  5)  (2)    2  (2X^  -  3X  +  6) 

A'  =  l  A' =  10 

2.  What  is  meant  by  the  statistical  analj^sis  of  a  group  of  data? 

3.  What  are  the  purposes  of  a  graphical  presentation  of  a  set  of  sta- 
tistical data? 

4.  What  is  a  histogram?  A  frequency  polygon?  Give  directions  for 
constructing  each. 

5.  What  is  meant  by:  "The  central  tendency  of  a  distribution"? 
The    dispersion  of  a  distribution"?  The  "skewness  of  a  distribution"? 

6.  Define  three  measures  of  central  tendency;  three  measures  of  dis- 
persion; three  measures  of  skewness. 

7.  From  the  formula  defining  the  arithmetic  mean,  derive  another 
formula  for  M, 
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8.  If  a  =  127  =b  0.2  and  6  =  2.2  ±  0.3,  find  the  extreme  values  of 


9.  In  what  type  of  distributions  are  Qi  and  Qs  equally  distant  from  Md*l 

10.  From  the  formula  defining  cr,  derive  two  other  formulas  for  com- 
puting a. 

11.  To  compute  (t,  is  it  necessary  to  compute  M? 

12.  A  measurement  of  the  length  of  a  room  is  recorded  22.6  db  0.05  feet. 
What  does  this  statement  mean? 

13.  The  arithmetic  mean  of  a  sample  distribution  of  100  grades  is 
written  70  dt  0.6.  What  does  this  statement  mean?  What  is  the  standard 
deviation  of  the  sample? 

14.  The  standard  deviation  of  a  sample  distribution  of  100  grades  is 
written  8  ±  0.38.  What  does  this  statement  mean?  What  is  the  standard 
deviation  of  the  sample? 

16.  Why  is  the  standard  deviation  a  good  measure  of  dispersion? 

16.  Criticize  the  following  statements: 

(1)  The  range  is  the  most  perfect  measure  of  variability  because  it 
includes  all  the  measurements. 

(2)  In  constructing  a  frequency  distribution  the  selection  of  the  class 
interval  is  arbitrary. 

(3)  If  the  probat>le  error  of  the  mean  is  attached  to  the  computed  mean, 
the  true  mean  is  then  exactly  known. 

(4)  If  the  sum  of  the  frequencies  is  equal  to  the  count  of  the  original 
scores,  the  tabulation  is  correct. 

(5)  A  score  recorded  as  80  means  that  the  measure  extends  from  80  to  81. 

(6)  If  a  class  is  designated  ''85-89,^'  the  correct  midpoint  would  always 
be  87.5. 

(7)  M  zt  a  establishes  an  interval  that  always  includes  about  (§)A^. 

17.  The  analysis  of  an  approximately  normal  distribution  of  the  weekly 
salaries  of  600  men  gave  M  =  $30  and  a  =  $5. 

(1)  About  how  many  received  salaries  between  $25  and  $35? 

(2)  Assuming  that  Range  =  6(r,  about  what  was  the  maximum  salary? 
The  minimum  salary? 

18.  The  heights  and  weights  of  1,515  men  gave  two  approximately 
normal  distributions  with  the  following  statistical  constants: 


(l)  a  +  b 


(2)  a-b 


(3)  a  .  6 


Heights  Weights 


N  =  1515  ==  1515 


M  =  67.92  inches  M  =  138.88  pounds 

Md  =  68.02  inches  Md  =  137.62  pounds 

a  =   2.43  inches  a  =    17.2  pounds 


(1)  Which  distribution  shows  the  greater  dispersion?  Why? 

(2)  Which  distribution  shows  the  greater  skewness?  Why? 
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19. 


20. 




X 

fix) 

3.85 



3 

4.05 

41 

4.25 

127 

4.45 

303 

4.65 

524 

4.85 

852 

5.05 

1033 

5.25 

1106 

5.45 

1137 

5.65 

983 

5.85 

799 

6.05 

532 

6.25 

281 

1 77 
111 

6.65 

80 

6.85 

37 

7.0.) 

16 

7.25 

3 

7.45 

3 

Total 

8037 

 : — rr^ 

X 

—  -  ^   

m 

  — 

51 

4 

52 

23 

53 

59 

54 

108 

55 

224 

56 

257 

57 

230 

58 

110 

59 

38 

60 

16 

61 

2 

Total 

1071 

The  accompanying  distribution  gives  the  per- 
centage fat  content  of  milk  as  shown  by  8,037 
milking  records.  The  data  were  taken  from 
Bulletin  245  of  the  University  of  lUinois  Agri- 
cultural Experiment  Station,  p.  603. 

Compute:  M,  Md,  Mo  by  fitting  a  parabola, 
Qi,  Qzy  a,  and  Sk. 

Find  Em  and  interpret  it.  Find  Ea  and 
interpret  it. 


The  data  in  the  accompanying  table  give  the 
head  circumference  (centimeters)  of  1,071  boys. 
The  data  were  taken  from:  "The  Evaluation 
of  Anthropometric  Data,''  by  Winfield  S.  Hall, 
Journal  of  American  Medical  Association,  Vol. 
37,  p.  1646. 

Find  Mf  a,  as  and  a4  for  this  distribution. 


21.  Wliat  arc  two  points  of  view  that  may  be  adopted  with  regard  to 
the  statistical  analysis  of  a  set  of  data? 

22.  Docs  a  meet  Yule's  requirements  of  a  good  average? 

23.  A  class  was  given  two  tests  with  the  following  results:  Mi  =  76, 
<Ti  =■  11;  M'2  =  59,  0-2  =  14.  A  student  made  92  on  the  first  test  and 
82  on  the  second  tost.  On  which  test  did  he  do  better? 

24.  The  following  distribution  presenting  the  life  experience  of  wooden 
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telephone  poles  was  adopted  from  Robley  Winfrey  and  Edwin  B.  Kurtz: 
Life  Characteristics  of  Physical  Property^  Bulletin  103,  Iowa  Engineering 
Experiment  Station,  p.  57.   Compute  Af,  o-,  Em  and  Ea. 


Life  in  Years 
X 

Number  of  Poles 
RevlcLced 
fix) 

Life  in  Years 
X 

Number  of  Poles 
Revl/iced 
fix) 

1 

4 

12 

95 

2 

7 

13 

91 

3 

4  mm 

15 

14 

73 

4 

32 

15 

64 

5 

OA 

1  £1 

Id 

6 

57 

17 

30 

7 

61 

18 

18 

8 

73 

19 

5 

9 

96 

20 

1 

10 

104 

21 

1 

11 

103 

22 

2 

Total 

1000 

25.  Criticize  the  following  statements: 

(1)  The  number  2.340  has  four  significant  figures. 

(2)  The  relative  error  in  a  measurement  is  the  ratio  of  the  absolute 
error  to  the  true  value  of  the  quantity  measured. 

(3)  The  population  of  a  city  was  recorded  as  300,000  ±  3,000.  The 
percentage  error  was  3  per  cent. 

(4)  The  length  of  a  line  was  measured  twenty  times.  The  arithmetic 
mean  of  the  measurements  gives  the  true  length. 

(5)  In  our  notation  X  indicates  class  frequency. 

(6)  The  guessed  mean,  h^  should  be  chosen  at  the  midpoint  of  a  class 
interval. 

(7)  Ordinarily  the  number  of  class  intervals  should  be  more  than  ten 
and  less  than  thirty. 

(8)  It  would  be  possible  for  three  people  to  get  three  different  frequency 
distributions  from  the  same  data  and  all  be  right. 

(9)  If  the  sum  of  the  frequencies  agrees  with  the  count  of  the  original 
measurements,  the  tabulation  of  the  frequency  distribution  is 
correct. 

(10)  The  quartile  points  are  used  to  measure  both  dispersion  and  skew- 
ness. 

(11)  No  matter  what  value  of  h  is  chosen,  the  same  result  will  be  ob- 
tained for  a  if  the  computation  is  correct. 

(12)  In  symmetrical  distributions  the  first  and  third  quartile  points  are 
equidistant  from  ikfj. 
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(13)  The  standard  deviation  is  a  point,  not  a  distance. 

(14)  The  range  of  a  mound-shaped  distribution  equals  So*  approximately 

(15)  The  probable  error  of  the  mean  shows  what  mistake  was  probably 
made  in  computing  M. 

(16)  The  statement  M  =  75  ±3  means  that  the  true  value  of  M  lies 
between  72  and  78. 

(17)  When  M  is  greater  than  Af^,  the  skewness  is  positive.  The  skew- 
ness  is  also  positive  if  q2  is  greater  than  ^i. 

(18)  If  the  probable  error  is  attached  to  a  statistical  constant,  the 
results  are  then  exact. 

(19)  A  distance  of  S(x  laid  off  on  both  sides  of  M  establishes  an  interval 
that  includes  about  99  per  cent  of  the  total  frequency  of  a  mound- 
shaped  distribution. 

(20)  For  a  manufacturer  of  hats,  the  mode  is  a  more  important  measure 
of  central  tendency  than  the  arithmetic  mean. 

(21)  Mh  of  a  group  of  numbers  is  the  reciprocal  of  M  of  the  group. 

(22)  Mh  of  the  numbers  2,  3,  and  6  is  greater  than  their  Af . 

26.  The  data  of  the  following  tables  are  taken  from  Bulletin  No.  623 
of  the  U.S.  Department  of  Labor,  "Wages,  Hours,  and  Working  Conditions 
in  the  Bread-Baking  Industry,  1934."  They  present  the  hourly  earnings 
in  December,  1934  of  employees  distributed  as  to  sex. 

Compute  Mj  Md,  (t,  and  Sk  for  each  distribution. 


Class 
(cents) 


Males 


Males  and  Females 


0  a.u.  12.5 
12.5  a.u.  17'.5 
17.5  a.u.  22.5 
22.5  a.u.  27.5 
27.5  a.u.  32.5 
32.5  a.u.  37.5 
37.5  a.u.  42.5 
42.5  a.u.  47.5 
47.5  a.u.  52.5 
52.5  a.u.  57.5 
57.5  a.u.  62.5 
62.5  a.u.  67.5 
67.5  a.u.  72.5 
72.5  a.u.  77.5 
77.5  a.u.  85.0 
85.0  a.u.  100 
100  a.u.  120 
120    a.u.  150 


1 
6 
14 


0 
1 
3 


1 
7 

17 


148 
,509 
1517 
2615 
2325 
1853 
1698 
1387 
1418 
1169 
1052 
876 
1148 
465 
147 


165 
635 
746 
545 
205 
138 


313 
1144 
2263 
3160 
2530 
1991 
1767 
1421 
1450 
1179 
1061 

882 
1161 

468 

147 


69 
34 
32 
10 
9 
6 
13 
3 
0 


Total 


18348 


2614 


20962 


Chapter  6 


INDEX  NUMBERS^ 

46.  INTRODUCTION 

In  the  preceding  chapters  we  have  devoted  no  Httle  attention  to  var- 
iation  as  a  characteristic  of  statistical  phenomena.  In  characterizing 
a  frequency  distribution,  we  devoted  an  entire  chapter  to  the  measure- 
ment of  dispersion,  a  measurement  of  the  extent  to  which  the  indi- 
vidual items  vary  on  the  average  from  the  arithmetic  mean.  From  one 
point  of  view,  simple  correlation  is  a  study  of  the  variation  that  occurs 
on  the  average  in  one  variable  when  a  linearly  related  variable  changes 
by  a  given  amount.  In  the  study  of  the  normal  curve,  we  must  have 
been  impressed  with  the  fact  that  the  equation  defining  this  curve  de- 
scribes a  very  particular  kind  of  variation  of  a  group  of  measurements 
from  their  arithmetic  mean.  Our  formulas  for  estimating  reliability 
are  efforts  to  define  a  range  of  variation  about  a  statistical  constant 
within  which  fluctuations,  due  to  pure  chance,  may  be  expoctc^d  to 
occur  according  to  definite  probabilities.  Each  of  these  important 
statistical  concepts  emphasizes,  therefore,  a  particular  kind  of  varia- 
tion. Speaking  rather  broadly,  we  may  say  that  statistical  analysis 
is  largely  a  study  of  variation  in  statistical  phenomena. 

In  this  chapter  we  shall  still  be  concerned  with  variation  as  a  char- 
acteristic of  our  data,  but  we  shall  regard  the  variation  in  a  different 
manner  than  we  have  done  previously.  Stated  in  rather  general 
terms,  our  present  objective  is  the  reduction  of  series  of  data,  more 
or  less  complex,  to  numbers  purely  relative  which  will  facilitate  com- 
parison. Thus,  we  shall  be  interested  primarily  in  measuring  relative 
variations  in  the  magnitudes  of  statistical  groups.  The  statistical  de- 
vices by  which  we  do  this  are  called  index  numbers. 

47.  RELATIVES 

In  their  simplest  forms,  index  numbers  are  ratios,  generally  ex- 
pressed as  percentages,  of  one  quantity  to  another  quantity  of  the 

1  This  chapter  may  be  omitted  without  destroying  the  continuity. 
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same  kind  called  the  base.  Index  numbers  have  been  most  widely 
employed  in  the  study  of  price  changes,  but  they  also  may  be  em- 
ployed in  the  study  of  variation  in  unemployment,  in  production,  in 
building,  in  manufacturing,  —  in  short,  wherever  group  movements 
are  to  be  measured. 


Table  30.  Production  op  Motor  Vehicles  in  the 
United  States,  1920-1929  ^ 


Year 

Number 

Relatives 

Link 

{in  thousands) 

to  1920 

Relatives 

(1) 

(2) 

(3) 

(4) 

1920 

2227 

100 

1921 

1682 

76 

76 

1922 

2646 

119 

157 

1923 

4180 

188 

158 

1924 

3738 

168 

89 

1925 

4428 

199 

118 

1926 

4506 

202 

102 

1927 

3580 

161 

79 

1928 

4601 

207 

129 

1929 

5622 

252 

122 

Consider  the  data  of  Table  30.  Column  (2)  gives  the  total  pro- 
duction (aggregates)  of  motor  vehicles  produced  in  the  United  States 
in  the  years  1920-1929.  It  is  readily  observed  from  column  (3) 
that  a  comparison  of  the  values  for  different  dates  with  the  value  at 
some  fixed  base,  or  a  study  of  the  variation  in  production  relative  to 
some  fixed  base,  is  greatly  facilitated  by  reducing  the  several  aggre- 
gates to  a  scries  of  percentages  (relatives).  If  the  production  in  1^20 
is  taken  as  the  date  production  and  is  represented  by  100,  the  pro- 
duction relative  for  any  other  year  merely  expresses  the  production 
of  that  year  as  a  percentage  of  the  production  for  the  base  year. 
That  is, 

,^  ,        -        .  Production  for  given  year 

Relative  lor  a  given  year  =  -^-^ — — ; — -7 — i  '        X  100 

Production  for  base  year 

Thus,  each  item  in  column  (3)  is  the  ratio  of  the  corresponding  item 
in  column  (2)  to  the  1920  production,  expressed  as  a  percentage. 


*  The  data  are  taken  from  Statistical  Abstract  of  the  United  States,  1930,  p.  385. 
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If  it  is  desired  to  compare  the  values  for  each  year  with  those  of 
the  preceding  year,  a  link  relative  may  be  employed.  The  link  relative 
for  any  year  is  constructed  by  dividing  the  value  in  that  year  by  the 
value  in  the  preceding  year,  and  expressing  the  result  as  a  percentage. 
That  is, 

T-  1     1  X-     r        •  Value  for  given  year  .rxr. 

Lmk  relative  for  a  given  year  =  — r  ^ — r-^          X  100 

Value  tor  preceding  year 

Thus,  in  Table  30,  the  Unk  relative  for  1922  is  ff||  X  100  =  157. 
In  order  to  distinguish  them,  the  relatives  shown  in  column  (3)  are 
called  fixed-base  relatives. 

The  link  relatives  thus  establish  a  chain  of  relatives,  each  year  being 
tied  to  the  preceding  year,  and  from  the  link  relatives  we  may  obtain 
a  further  set  of  relatives  called  chain  relatives.  We  assign  100  as  the 
chain  relative  for  the  first  year  and  define  the  chain  relative  for  any 
other  year  to  be  the  product  of  the  Unk  relative  for  that  year  and  the 
chain  relative  for  the  preceding  year,  the  product  to  be  divided  by 
100.  It  should  be  evident  from  the  definitions  that,  when  a  single 
commodity  is  involved,  the  chain  relatives  are  equal  to  the  fixed- 
base  relatives. 

Simple  relatives  may  be  employed  to  compare  the  fluctuations  in 
two  or  more  variables,  and  to  permit  the  computation  of  an  average 
price  relative.  To  facilitate  the  comparison  of  the  fluctuations  in  the 
prices  of  com  and  hogs  in  the  United  States  for  the  decade  1920- 
1929  —  see  Table  31  —  we  have  computed  their  fixed-base  relatives, 
shown  in  columns  (4)  and  (5),  with  the  prices  in  1920  as  the  base 
prices.  It  can  now  be  seen  at  a  glance  how  one  set  of  relative  prices 
changes  as  compared  with  the  other.  To  explain  the  behavior  of 
the  fluctuations  recorded  in  the  table  would  require  other  data  that 
are  not  included  here.  The  numbers  in  column  (6),  which  are  the 
arithmetic  means  of  the  numbers  in  columns  (4)  and  (5),  give  the 
average  price  relatives  based  upon  the  two  given  commodities.  Thus, 
the  general  average  price  of  these  two  commodities  was  3  per  cent 
higher  in  1924  than  in  1920,  and  was  19  per  cent  lower  in  1923  than 
in  1920. 

The  price  relatives  in  Table  31  have  been  based  upon  the  prices 
of  1920.  Of  course  the  prices  for  any  other  year  could  have  been 
chosen  as  the  bases.   The  averages  of  the  decade  prices,  70.3  cents 
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Table  31.  Prices  of  Corn  and  Hogs  in  the  United  States 
FOR  the  Years  1920-1929,  and  Their  Relatives  ^ 


Corn 

Hogs 

Relatives  (1920  =  100) 

Average 

Year 

(cent^  ver 

(dollars  ver 

vrice 

bushel) 

100  pounds) 

Price  of  corn 

Price  of  hoas 

relative 

(1) 

(2) 

(3) 

(4) 

V*) 

(6) 

1920 

67.2 

13.91 

100 

100 

100 

1921 

42.3 

8.51 

63 

61 

62 

bo.o 

98 

66 

oo 

82 

1923 

72.6 

7.55 

108 

54 

81 

1924 

98.2 

8.11 

1  /IT 

147 

58 

103 

1925 

67.4 

11.81 

101 

85 

93 

1926 

64.2 

12.34 

96 

89 

93 

1927 

72.3 

9.95 

108 

72 

90 

1928 

75.2 

9.22 

112 

66 

89 

1929 

78.1 

10.16 

117 

73 

95 

Totals 

703.1 

100.78 

1050 

724 

888 

Means 

70.3 

10.08 

105 

72 

89 

per  bushel  for  corn  and  10.08  dollars  per  hundred  pounds  for  hogs, 
would  have  been  more  satisfactory  bases  since  they  are  representa- 
tive and  are  less  affected  by  chance  variations. 


EXERCISES 

1.  Compute  the  fixed-base  relatives  (1909  =  100)  for  the  data  of  Table 
11,  pap;e  45. 

2.  Compute  the  fixed-base  and  the  link  relatives  (1909-1910  =  100) 
for  the  data  of  Exercise  18,  page  106. 

3.  With  the  arithmetic  mean  of  the  production  as  base,  compute  the 
fixed-base  rehitives  for  the  data  of  Exercise  12,  page  57. 

4.  Using  the  arithmetic  means  of  columns  (2)  and  (3)  as  bases,  compute 
the  average  price  relatives  for  the  data  of  Table  31. 

48.   DEFINITIONS  AND  NOTATION 

We  have  defined  index  numbers  to  be  devices  which  summarize 
the  relative  fluctuations  in  a  group  of  variables.  Inasmuch  as  the 
essential  purpose  of  an  index  number  is  to  measure  the  variation  in  a 

*  The  data  are  taken  from  Statistical  Abstract  of  the  United  States,  1930,  p.  682 
and  p.  661. 
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group  of  variables,  it  is  probably  better  practice  to  employ  the  terms 
relative  numbers''  and  relatives''  when  referring  to  single  series 
in  terms  of  a  fixed  base,  and  to  reserve  the  term  index  number  to 
describe  the  variation  in  a  group  of  variables  in  combination.  The 
numbers  in  column  (3)  of  Table  30  may  properly  be  called  rela- 
tives" whereas  those  of  column  (6)  of  Table  31  may  properly  be 
called  '4ndex  numbers."  While  index  numbers  are  sometimes  ex- 
pressed as  mere  aggregates,  yet  more  generally  they  are  expressed  as 
percentages  of  the  values  in  an  arbitrarily  chosen  base  period.^ 

Many  methods  may  be  employed  in  the  construction  of  index 
numbers,  and  there  are  differences  of  opinion  as  to  which  is  the  best 
method.  In  our  treatment,  we  shall  devote  the  emphasis  to  the  best 
known  methods  of  construction  and  attempt  to  avoid  controversial 
questions.  We  shall  make  use  of  the  following  symbols: 

Po  =  price  of  the  first  commodity  at  time  ''0"  (the  base  period) 
Pi  =  price  of  the  first  commodity  at  time  ^^i" 
pO%)     pfiQQ  of  the  nth  commodity  at  time  ^^z" 
Qo  =  quantity  of  the  first  commodity  at  time  ^'0" 
qi  =  quantity  of  the  first  commodity  at  time 
q^f  =  quantity  of  the  nth  commodity  at  time 

~  =  a  price  relative  (ratio  of  price  of  a  given  commodity  at  time 
*^t"  to  the  price  of  the  same  commodity  at  time  ^^0,"  ex- 
pressed as  a  percentage) 

^  =  a  quantity  relative 


In  the  construction  of  unweighted  (or  simple)  index  numbers,  the 
individual  members  of  the  group  are  all  regarded  as  of  equal  im- 
portance. The  influence  of  no  member  of  the  group  is  to  be  weighted 


*  In  this  book  we  shall  assume  that  relatives  and  index  numbers  are  expressed 
as  percentages. 
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by  multiplying  the  member  by  some  quantity  or  weight.  If  some 
members  of  the  group  are  to  be  considered  as  more  important  than 
others,  we  shall  apply  to  the  important  members  weights  that  are 
expected  to  reflect  their  relative  importance.  Unweighted  indices 
will  be  considered  in  this  section;  weighted,  in  the  next. 

A.  Simple  Aggregative  Relatives.  An  aggregative  index  number 
is  based  upon  the  sums  (aggregates)  of  the  items  for  the  several  years. 
The  aggregative  relative  is  found  by  comparing  the  results  thus  secured 
for  different  dates.  If  prices  are  in  question,  the  aggregative  relative 
is  given  by 


(1) 


To  illustrate  the  method  of  computing  aggregative  relatives,  let  us 
consider  the  data  of  Table  32,  which  gives  the  farm  prices  in  cents 


Table  32.  Farm  Prices  in  Cents  per  Bushel  of  Grains 

IN  the  United  States  ^ 

Computing  the  aggregative  relatives 


Gram 

19^1 

1923 

1925 

1927 

1929 

Corn  

42.3 

72.6 

67.4 

72.3 

78.1 

Wheat  

92.6 

92.3 

141.6 

111.5 

104.3 

Oats  

30.2 

41.4 

38.0 

45.0 

43.5 

Rye  

69.7 

65.0 

78.2 

85.3 

87.1 

Barley  

41.9 

54.1 

58.8 

67.8 

55.0 

Buckwheat  

81.2 

93.3 

88.8 

83.5 

97.7 

Rice  

95.2 

110.2 

153.8 

92.9 

97.8 

453.1 

528.9 

()26.6 

558.3 

563.5 

P  ^2p. 

100 

117 

138 

123 

124 

2.po 

per  bushel  of  seven  important  grains.  We  find  the  aggregates  Sp,-, 
of  the  prices  for  each  of  the  several  years.  Choosing  1921  as  the 
base  year  (where  i  =  0),  we  find  the  aggregative  relatives  Sp»  /  Spo 
for  the  other  years  and  express  our  results  as  percentages.  We  note 
that  the  aggregative  relative  for  1925  is  138.  This  may  be  interpreted 

^  The  data  are  taken  from  Statistical  Abstra^  of  the  United  States 1 1930,  pp.  682- 
683. 
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to  mean  that  the  farm  prices  of  these  grains  for  1925  were,  on  the 
average,  38  per  cent  higher  than  for  1921. 

It  is  evident  that  the  computation  of  the  aggregative  relative  re- 
quires that  all  items  be  reduced  to  the  same  unit,  otherwise  we  would 
be  combining  non-homogeneous  things  and  the  sums  would  have  no 
meaning.^  To  illustrate,  consider  the  following  prices  of  several 
commodities  in  1925: 

Anthracite  coal  $5.30  per  ton  (2000  pounds) 
Cotton  0.182  per  pound 

Potatoes  2.10  per  bag  (100  pounds) 

Wheat  1.60  per  bushel  (60  pounds) 

We  may  reduce  these  prices  to  the  same  unit,  and  quote  them  as 
follows: 

Anthracite  coal  $00,265  per  100  pounds 
Cotton  18.20  per  100  pounds 

Potatoes  2.10  per  100  pounds 

Wheat  2.67  per  100  pounds 

The  well-known  BradstreeVs  index  is  based  upon  the  simple  aggre- 
gative method,  the  items  being  reduced  to  prices  per  pound.  The 
aggregates  Spi,  themselves,  are  the  indexes;  however,  they  may  be 
converted  into  a  series  of  percentages  upon  any  chosen  base.  It 
should  be  noted  that  the  conversion  of  all  prices  into  prices  per  pound 
affects  a  concealed  weighting  for  which  there  is  no  logical  basis. 
Thus,  in  1925  in  an  aggregate  of  per  pound  prices,  a  pound  of  cotton 
was  worth  9  times  as  much  as  a  pound  of  potatoes  and  69  times  as 
much  as  a  pound  of  coal.  This  illogical  emphasis  given  to  high-priced 
articles  is  somewhat  neutralized  in  BradsireeVs  index  by  the  intro- 
duction of  a  logical  element  in  that  more  than  one  quotation  is  given 
for  some  of  the  more  important  commodities  and  only  one  for  the 
less  important  articles. 

B.  Simple  Average  of  Relatives.  Another  method  of  constructing 
index  numbers  is  that  of  finding  some  simple  average  of  the  relatives 
for  the  given  items,  the  relative  for  a  given  commodity  at  a  given 
time  being  referred  to  the  same  commodity  at  a  certain  basic  date. 
We  may  use  the  arithmetic  mean,  the  geometric  mean,  the  median, 

*  It  should  not  be  assumed  that  an  aggregative  relative  based  upon  such  a  re- 
duction will  necessarily  present  a  logical  index. 
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the  mode,  and  the  harmonic  mean  of  the  relatives.  Assuming  that  a 
table  of  actual  amounts  has  been  prepared  —  such,  for  example,  as 
the  prices  of  Table  32  —  the  steps  involved  in  the  process  are: 

1.  Reduce  each  item,  —  price,  quantity,  value,  et  cetera,  —  in  the  time 

for  which  the  index  is  desired  to  a  percentage  (relative)  of  the 
item  for  the  same  commodity  in  the  base  period.  That  is,  if  prices 
are  in  question,  find  Pt/po  for  each  commodity;  if  quantities  are  in 
question,  find  q^|q(^  for  each  commodity;  if  values  are  in  question, 
find  Vt/ro,  and  express  all  the  relatives  as  percentages. 

2.  Compute  the  averages  of  the  relatives  found. 

The  arithmetic  mean  of  the  price  relatives  at  time  '^i"  is  given  by 

^, = 1,2^;  (2) 

where     is  the  number  of  prices. 

The  geometric  mean  of  the  N  price  relatives  at  time  '^t^Ms  given  by 


oP.-  =  v/l^  X  f  X  •  •  •  X  IK  =  V/n|  (3) 

^  po      po  po         ^  po 

where  IT  means  *'the  product  of  such  terms  as,**  and  is  computed 
with  the  aid  of  logarithms. 

The  median  of  the  relatives  at  time  ''i"  is,  of  course,  found  by 
arranging  the  relatives  at  time  in  the  order  of  their  magnitude. 
If  N  is  odd,  the  middle  term  is  the  median.  If  N  is  even,  we  define 
the  median  to  be  one  half  the  sum  of  the  two  middle  terms. 

The  harmonic  mean  of  the  relatives  at  time  is  given  by  the 
formula 

p.  -  ^  -  J!L  (4) 

prpr  ■  ■  ■  '^p'f  ^pi 

"  In  column  (6)  of  Table  31  we  have  shown  the  arithmetic  means 
of  the  relatives  for  the  prices  of  two  commodities,  corn  and  hogs,  for 
the  years  1921  to  1929  with  the  year  1920  as  the  base.  When  several 
commodities  are  being  investigated,  it  is  better  to  arrange  the  table 
with  the  list  of  commodities  in  the  stub  and  the  times"  in  the  box 
heads  as  was  done  in  Table  32. 

As  an  illustrated  problem,  consider  Table  33  which  gives,  in  the 
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Table  33.  Price  Relatives  of  Grains  in  the  United  States, 

Based  upon  Table  32 

(1921  =  100) 
Computing  simple  averages  of  relatives 


Gram 

1  0^*9 

Corn  

100 

172 

159 

171 

187 

Wheat  

100 

100 

153 

120 

113 

Oats  

100 

137 

126 

149 

144 

Rye  

100 

93 

112 

122 

125 

100 

129 

140 

162 

131 

Buckwheat  

100 

115 

109 

103 

120 

Rice  

100 

116 

162 

98 

103 

Totals 

700 

862 

961 

925 

923 

Arithmetic  Mean  of 

relatives 

100 

123 

137 

132 

132 

Median  of  relatives 

100 

116 

140 

122 

125 

Geometric  Mean  of 

relatives 

100 

121 

136 

130 

130 

body  of  the  table,  the  relatives  of  the  farm  prices  of  seven  important 
grains  in  the  United  States.  These  data  were  derived  from  Table  32 
by  methods  previously  explained.  Thus,  the 

price  relative  for  corn  in  1923  =  ^  X  100  =  172 

97  7 

price  relative  for  buckwheat  in  1929  =        X  100  =  120 

We  continue  this  process  until  the  table  of  relatives  is  complete. 
We  then  compute  the  averages  for  the  several  years.  For  example, 
the  geometric  mean  of  the  relatives  for  1923  is  given  by 

oP<  =  -^^^172  .  100  •  137  •  93  •  129  •  115  •  116 
or  by 

log  oP»  =  |[log  172  +  log  100  +  log  137  +  log  93  +  log  129 

+  log  116  +  log  116] 
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Numbers 

172 
100 
137 
93 
129 
115 
116 


log  oPi 
oPi 


Logarithms 

2.23  553 
2.00  000 
2.13  672 
1.96  848 
2.11  059 
2.06  070  , 
2.06  446 
7114.57  648 


=  2.08  235 
=  120.9 


EXERCISES 

1.  Find  the  aggregative  relatives  for  the  production  data  given  in 
Table  34  using  1921  as  the  base  year. 


Table  34.  Production,  in  Millions  of  Bushels,  of  Grains 

IN  THE  United  States • 


Grain 

1923 

1925 

1927 

1929 

Com  

3069 

3054 

2916 

2763 

2622 

Wheat  

815 

797 

677 

878 

807 

Oats  

1078 

1306 

1488 

1183 

1239 

Rye   

62 

63 

46 

58 

41 

Barley  

155 

198 

214 

266 

307 

Buckwheat  . . . 

14 

14 

14 

16 

12 

Rice  

38 

34 

33 

45 

40 

Totnh 

5231 

5466 

5388 

5209 

5068 

Aggregative 
Relative 

2.  Compute  the  harmonic  mean  of  the  relatives  for  the  data  of  Table  32. 

3.  Compare  the  five  index  numbers  that  have  been  computed  for  the 
data  of  Table  32. 

4.  Verify  the  production  relatives  given  in  Table  35.  For  this  table, 
compute  (1)  the  arithmetic  means,  (2)  the  medians,  (3)  the  geometric 
means  of  the  relatives  for  the  years  1923,  1925,  1927,  and  1929. 

*  The  data  are  taken  from  Statistical  Abstract  of  the  United  States.  1930,  pp. 
682-683. 
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Table  35.    Production  Relatives  of  Grains  in  the  United  States, 

Based  upon  Table  34 

(1921  =  100) 


Grain 

1921 

1923 

1925 

1927 

1929 

Corn  

.  100 

100 

95 

90 

85 

Wheat 

100 

98 

83 

108 

99 

Oats  

100 

121 

138 

110 

115 

Hye   

100 

102 

74 

94 

66 

Barley  

100 

128 

138 

172 

198 

Buckwheat  

100 

100 

100 

114 

86 

Rice  

100 

89 

87 

118 

105 

50.  WEIGHTING 

In  our  previous  discussion  we  attempted  to  regard  all  items  as  of 
equal  importance,  although  a  concealed,  ^'unconscious^'  weighting 
was  conceded.  We  admitted  the  existence  of  weights  inherent  in 
the  data  themselves  and  called  attention  to  the  fact  that  doing 
nothing  about  them  may  lead  to  illogical  results.  We  shall  now 
consider  how  the  illogical  results  may  be  somewhat  eliminated  by 
the  process  of  weighting.  Weighting  is  the  term  used  to  describe  the 
conscious  effort  to  assign  to  each,  commodity  an  influence  that,  in 
the  final  result,  is  proportionate  to  its  relative  importance.  The 
index  number  that  results  from  conscious  weighting  is  called  a 
weighted  index  number.  When  no  such  conscious  endeavor  is  made 
and  each  item  is  permitted  to  exercise  an  influence  upon  the  result 
presumably  equal  to  that  of  every  other  item,  the  index  is  said  to 
be  unweighted  or  simple. 

The  weights  are  usually  determined  upon  some  rational  basis  such 
as  the  quantities  produced  or  consumed  in  a  representative  period,  an 
average  of  the  quantities  produced  or  consumed  over  several  periods, 
or  some  other  criterion.  As  multipliers,  it  is  obvious  that  the  weights 
may  be  abstract  numbers,  and  thus  that  the  weights  may  be  numbers 
"proportional  to  the  quantities  produced  or  consumed.  The  fact  that 
actual  quantity  figures  of  production  and  consumption  have  become 
increasingly  available  within  recent  decades  has  tended  to  encourage 
their  use  as  weights  in  index  number  construction.   Two  methods 
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of  weighting  by  quantity  figures  are  widely  used:  the  first  is  called 
weighting  by  base  period  quantities/'  and  the  second  is  called 
weighting  by  given  period  quantities/'    A  third  method,  that  of 
weighting  by  an  average  of  the  base  period  quantities  and  the  given 
period  quantities,  is  growing  in  favor. 

There  are  two  reasons  for  weighting  by  base  period  quantities. 
In  the  first  place,  despite  the  increasing  availability  of  quantity 
figures,  they  are  not  easily  obtained  for  many  commodities  for  the 
given  period.  In  the  second  place,  the  relative  variations  in  the 
quantities  from  period  to  period  are  frequently  not  sufficiently  large 
to  result  in  significant  errors  in  the  indexes  when  the  quantities  are 
assumed  constant  for  a  few  successive  periods. 

51.   WEIGHTED  AGGREGATES 

If  we  employ  the  quantities  produced  in  the  base  period  Qq  as 
weights,  the  weighted  aggregative  relative  index  number  of  prices 
at  time  "i''  is  given  by 

which  is  merely  the  ratio  of  the  \veighted  aggregate  at  time  '^i''  to 
the  total  value  in  the  base  period.  This  is  possibly  our  most  widely 
used  index  number. 

We  shall  illustrate  the  use  of  this  formula  in  Table  36  by  con- 
structing the  index  number  based  upon  the  weighted  aggregate  of 
actual  prices  in  cents  per  bushel  of  grains  in  the  United  States  for  the 
year  1925  with  the  year  1921  as  the  base  year.  The  data  are  taken 
from  Tables  32  and  34. 

If  we  employ  the  quantities  produced  in  the  given  period  qi  as 
weights,  the  weighted  aggregative  relative  index  number  of  prices 
at  time       is  given  by 

oPi  =  ^yT—  (5b) 

We  shall  find  that  the  weighted  aggregative  relatives,  (5a)  and 
(5b),  are  basic  formulas  for  the  ''Ideal''  index  given  in  Section  55. 

To  illustrate  the  construction  of  an  index  of  weighted  aggregates 
based  upon  formula  (5b),  we  request  the  student  to  complete  Table 
37.  The  data  are  taken  from  Tables  32  and  34. 
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Table  36.  Index  Number  of  Grain  Prices  in  the  United 

States  for  1925 

(1921  =  base  year) 


Weighted  aggregative  method 


Price  19^1 

Price  1925 

Price  1921 

Weight 

times 

Price  1925 

times 

Grain 

Weight 

Weight 

Vo 

Qo 

Pi 

PiQo 

Corn  

42.3 

3069 

129  818.7 

67.4 

206  850.6 

Wheat  

92.6 

815 

75  469.0 

141.6 

115  404.0 

Oats  

30.2 

1078 

32  555.6 

38.0 

40  964.0 

Rye  

69.7 

62 

4  321.4 

78.2 

4  848.4 

Barley  

41.9 

155 

6  494.5 

58.8 

9  114.0 

Buckwheat  

81.2 

14 

1  136.8 

88.8 

1  243.2 

Rice  

95.2 

38 

3  617.6 

153.8 

5  844.4 

Totah 

253  413.6 

384  268.6 

Price  1921  is  in  cents  per  bushel. 
Price  1925  is  in  cents  per  bushel. 

Weights  are  quantities  produced  in  1921  in  millions  of  bushels, 

Spogo  =  253  423.6  Sp^go  =  384  268.6 

_  Zp.^o  _  384  268.6 
™  Zpo^o  "  253  413.6  ~  ^^^'^ 


Table  37.  Production  and  Price  of  Grains  in  the 
United  States  in  1921  and  1925 


Grain 

Price 
(cents) 

Production 
{millions  of  bushels) 

1921 

1925 

1921 

1925 

Po 

qo 

Qi 

Poqi 

PiQi 

Corn  

42.3 

67.4 

3069 

2916 

Wheat  

92.6 

141.6 

815 

677 

30.2 

38.0 

1078 

1488 

Rye  

69.7 

78.2 

62 

46 

Barley  

41.9 

58.8 

155 

214 

Buckwheat  

81.2 

88.8 

14 

14 

Rice  

95.2 

153.8 

38 

33 

Totals 
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EXERCISES 

1.  (Lovitt  and  Holtzclaw.)  The  following  table  gives  the  average  price 
and  weights  (quantity  used  per  year  by  the  average  workingman's  family) 
of  several  important  items  of  food.  Using  1913  as  the  base  year,  compute 

(a)  the  simple  aggregative  relative  of  prices  for  1915,  1918,  1920  and 
1922, 

(b)  the  simple  arithmetic  mean  of  relatives  for  1915,  1918,  1920  and 
1922, 

(c)  the  weighted  aggregative  relative  index  for  the  years  1920  and  1922. 


Commodity 

Unit 

Weights 

Prices 

1913 

1915 

1918 

1920 

1922 

Sirloin  Steak 

lb. 

15 

$0,254 

$0,257 

$0,389 

$0,437 

$0,374 

Round  Steak 

lb. 

40 

.223 

.230 

.369 

.395 

.323 

Bacon 

lb. 

13 

.270 

.269 

.529 

.523 

.398 

Eggs 

doz. 

70 

.345 

.341 

.569 

.681 

.444 

Butter 

lb. 

76 

.383 

.358 

.577 

.701 

.479 

Milk 

qt. 

424 

.089 

.088 

.139 

.167 

.131 

Flour 

ibbl. 

8 

.809 

1.029 

1.642 

1.985 

1.250 

Potatoes 

peck 

50 

.255 

.225 

.480 

.945 

.420 

Sugar 

lb. 

145 

.055 

.066 

.097 

.194 

.073 

2.  The  following  were  the  retail  prices  of  some  foods  during  1926  and 
1934: 


Commodity 

Round  Steak 

Potatoes 

Beans 

Butter 

Coffee 

Flour 

Unit 

lb. 

Ih, 

Ih. 

lb. 

lb. 

lb. 

Price  1926 

$0.36 

$0.05 

$0.09 

$0.58 

$0.51 

$0.06 

Price  1934 

.28 

.02 

.06 

.31 

.29 

.05 

Compute  the  simple  indexes  to  fill  in  the  blanks  below,  and  criticize  them. 

1926  =  Base  Year         1934  =  Base  Year 
1926       1934  1926  1934 

A.M.  of  relatives       100      100 

G.M.  of  relatives       100      100 

Aggre.  relative  100      100 
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52.   WEIGHTED  AVERAGES  OF  RELATIVES 

Weighted  index  numbers  may  also  be  computed  from  weighted 
averages  of  relatives.  The  averages  most  widely  used  are  the  arith- 
metic mean  and  the  geometric  mean.  The  formulas  for  the  weighted 
arithmetic  and  the  weighted  geometric  means  are  derived  immediately 
from  those  given  on  pages  61  and  90  by  simply  considering  the  fre- 
quencies as  weights.  Thus,  if  Xi  is  any  value  and  its  weight,  we 
have 

Wi  +  W2  +  '  '  •  +  Wn  Su;  N 

for  the  weighted  arithmetic  mean,  and 

Mg  =  ^XTX^^  '  '  '  X'iC  =  ^TlXf  (7) 
for  the  weighted  geometric  mean,  where 

N  -        +  W2  +  '    '    '  +  Wn 

Written  logarithmically,  formula  (7)  becomes 

Wi  log  Xi  +  W2\0g  X2  +  '    '    '  +  Wn  log  Xn 


log  Mg  = 


Wj,  +  W2  +  '    '    '  +  IVn 

HiW  log  X     Xw  log  X 


Su;  N 
The  weighted  harmonic  mean  is  given  by 


(8) 


Af    =  ^1  +  ^2  +  •    '    '  +        ^  ^  N 

Xi  X2  Xn  X         maJ  ^ 

It  should  be  emphasized  that  *4n  weighting  individual  price  rela- 
tives, quantities  will  not  serve.  The  abstract  relatives  must  be 
weighted  by  values^  if  the  resulting  products  are  to  be  comparable. 
For  values  are  in  terms  of  a  common  dollar  unit,  while  quantities 
may  be  expressed  in  a  variety  of  units.'' ^ 

A.  The  Weighted  Arithmetic  Mean  of  Relatives.  An  index  of 
this  type  may  be  obtained  in  several  ways.  We  may  weight  each 
relative  by  base-period  values,  by  given  period  values,  or  by  an 

^  Frederick  C.  Mills,  Statistical  Methods,  Revised,  1938,  p.  195. 
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average  of  the  base  period  values  and  the  given  period  values. 
Weighting  by  base  period  values  is  the  method  most  widely  used. 

To  compute  a  weighted  arithmetic  mean  of  relatives  for  time 
weighting  by  base  period  values,  we  multiply  each  relative  Pi/po  by 
the  value  poqo  of  the  corresponding  commodity  in  the  base  period, 
and  express  the  sum  of  the  products  as  a  relative  of  the  total  value  in 
the  base  period. 

We  shall  illustrate  the  computation  of  this  type  of  index  in  Table 
38  by  constructing  the  index  of  the  prices  of  grains  in  the  United 
States  in  the  year  1925  with  the  year  1921  as  the  base  year.  The 
data  are  taken  from  Tables  33  and  36. 


Table  38.  Index  Number  of  Grain  Prices  in  the 

United  States  for  1925 

(1921  =  base  year) 


Weighted  arithmetic  mean  of  relatives 


Relative 

Grain 

Relative 

Relative 

Weight 

Price  1925 

Price  1921 

Price  1925 

Poqo 

times  Weight 

(3)  X  (4) 

(1) 

(2) 

(3) 

(4) 

(5) 

Corn  

100 

159 

129  818.7 

20  641  173.3 

Wheat  

100 

153 

75  469.0 

11  546  757.0 

Oats  

100 

126 

32  565.6 

4  103  265.6 

Rye  

100 

112 

4  321.4 

483  996.8 

Barley  

100 

140 

6  494.5 

909  230.0 

Buckwheat  

100 

109 

1  136.8 

123  911.2 

Rice  

100 

162 

3  617.6 

586  051.2 

Totals 

253  423.6 

38  394  385.1 

The  relative  prices  1921  and  1925  were  taken  from  Table  33. 
The  weights,  values  of  the  respective  grains  in  1921,  were  taken  from 
Table  36. 

S  (price  1925  X  weight)  =  38  394  385.1 

S  weight  =  253  423.6 


38  394  385.1 
253  423.6 


oPi  =  ...on    =  151.6 


which  is  the  same  index  as  that  secured  from  the  computations  of  Table  36. 
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The  equality  of  values  for  the  indexes  secured  by  the  two  methods 
illustrated  in  Tables  36  and  38  is  not  a  coincidence  for  the  weighted 
arithmetic  mean  of  relative  prices,  weighted  by  the  values  in  the  base 
year,  is  always  equal  to  the  relative  of  aggregates  weighted  by  base 
year  quantities.  For 

X  pJgS  +  f5  X  PoTo  +  •  •  •  +  |S  X  p<„»>«<o"'  2..„„ 

vW^  +  vli^  +  •  •  •  +  vVi'V  "  Spogo 

or,  more  briefly, 

which  was  given  in  (5). 

The  arithmetical  computations  of  Table  38  could  have  been 
considerably  reduced  by  replacing  the  weights  po^o  in  column  (4) 
by  the  numbers  130,  75,  33,  4,  6,  1,  4  to  which  the  weights  are  ap- 
proximately proportional  (see  Theorem  II,  p.  9).  We  shall  leave  it 
as  an  exercise  for  the  student  to  show  that  the  result  is 

oPi  =  38  348/253  =  151.6 

In  the  construction  of  Table  38,  we  made  use  of  relatives  and 
values  that  previously  had  been  computed  from  the  original  data  and 
recorded  in  Tables  33  and  36.  Generally,  one  is  called  upon  to 
construct  the  inde;c  from  the  original  data,  and  we  suggest  the 
following  form  for  the  work-sheet  when  the  weights  are  the  values  of 
the  respective  commodities  for  the  base  year.  Of  course  if  the 
weights  are  numbers  proportional  to  the  values,  columns  (7)  and 
(8)  can  be  changed  accordingly.  For  the  greatest  simplicity  in  com- 
putation, the  weights  should  be  expressed  as  percentages  of  Spo^o. 
This  will  mean  that  the  sum  of  the  weights  is  100,  and  the  consequent 
division  can  be  performed  mentally. 

It  is  evident  that  the  numbers  in  column  (8),  which  are  derived  by 
multiplying  columns  (6)  and  (7),  will  give  the  actual  values  p,go 
only  when  the  relatives  given  in  column  (6)  are  accurate.  Since  the 
weights  may  be  numbers  proportional  to  po^o,  column  (8)  should 
always  be  found  by  multiplying  (6)  and  (7)  and  not  by  multiplying 
p»  and  go. 
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Form  for  Computing  Index  Numbers 
Weighted  arithmetic  mean  of  relatives  method 


Weights:  Base  year  values 


Commodity 

Unit 

Frice 
Base 
Year 

Frice 
Given 
Year 

Qwiniiiy 
Base 
Year 

Relative 
Frice 
Given 
I  ear 

Weight 

Proditd 
0/  Weight 

and 
Relative 
Price 

Po 

Pi 

90 

Pi/Po 

PoQo 

(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

(7) 

(8) 

1st  Commodity 
2nd  Commodity 

Po  . 

Po 

p'i 

q'o 

q'i 

P'Jp'o 

p':/Po 

pWo 
pWo 

nth  Commodity 

•     •  • 

P  0 

P'V 

qV 

p'V/p7 

pWo' 

p'VqV 

Totals 

^PqQo 

Description  of  data: 

Prices  base  year  are  in  units. 

Prices  given  year  are  in  units. 


Index  number  —  oPi  —  ^  ■ 

B.  Weighted  Geometric  Mean  of  Relatives.  A  verbal  interpre- 
tation of  formula  (8)  will  point  out  the  steps  to  be  taken  in  con- 
structing the  weighted  geometric  mean  of  relatives.  The  steps  are 
as  follows: 

1.  Compute  the  relatives  for  the  period        for  which  the  index  is  being 
constructed. 

2.  Find  the  logarithm  of  each  relative. 

3.  Multiply  each  logarithm  by  the  given  weight. 

4.  Add  the  results  obtained  in  Step  3. 

5.  Divide  the  total  obtained  in  Step  4  by  the  sum  of  the  weights.  This 
gives  log  Mg, 

6.  Find  the  antilogarithm  of  the  quantity  obtained  in  Step  5.   This  is 
the  weighted  geometric  mean  of  the  relatives. 

We  shall  illustrate  the  computation  of  this  type  of  index  in  Table 
39  by  constructing  the  index  of  the  prices  of  grains  in  the  United 
States  in  the  year  1925  with  the  year  1921  as  the  base  year.  The 
relatives  have  been  computed  in  Table  33.   We  shall  use  as  weights 
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the  numbers  130,  75,  33,  4,  6,  1,  4  which  are  proportional  to  the 
actual  values,  given  in  Table  36,  of  the  commodities  in  the  base 
year. 


Table  39.  Index  Number  of  Grain  Prices  in  the 

United  States  for  1925 

(1921  =  base  year) 


Grain 
(1) 

Relative 
Pnce  1925 

(2) 

Logarithm  of 
the  Relative 
Price 

(3) 

Weight 
(4) 

Logarithm 
times  Weight 
(3)  X  (4) 
(5) 

Corn  

159 

2.20  140 

130 

286.18  200 

Wheat  

153 

2.18  469 

75 

163.85  175 

126 

2.10  037 

33 

69.31  221 

Rye  

112 

2.04  922 

4 

8.19  688 

Barley   

140 

2.14  613 

6 

12.87  678 

Buckwheat  

109 

2.03  743 

1 

2.03  743 

Rice  

162 

2.20  952 

4 

8.83  808 

Totals 

253 

551.29  513 

Relative  prices  for  1925  were  taken  from  Table  33.  The  weights  are 
numbers  proportional  to  actual  values  of  the  commodities  produced  in  the 
base  year.  They  were  taken  from  Table  36. 

log  M,  =  ^^^-^3^^^  =  2.17  903 
Mg  =  151.2 

The  student  will  note  that  the  three  index  numbers  we  have 
computed  for  these  data  on  grains  show  a  slight  variation.  The 
methods  used  in  Tables  36  and  38  result  in  an  index  of  151.6,  whereas 
Table  37  gives  150.1  and  Table  39  gives  151.2.  To  judge  the  relative 
merits  of  these  indexes,  we  shall  consider  certain  tests  in  Sections  54 
and  55. 

In  the  construction  of  Table  39,  we  made  use  of  computations 
that  previously  had  been  made  upon  the  original  data.  Generally, 
one  is  required  to  construct  an  index  from  the  original  data,  and  we 
suggest  the  following  arrangement  for  the  worksheet  when  computing 
an  index  from  original  data  by  means  of  the  geometric  mean  of  the 
relatives. 


WEIGHTED  AVE 


RAGES   OF  RELATIVES 
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CO 


O 

P 
Pm 

o 

« 
o 


o 


CO 


^3 


d 

O 
O 

•  r-H 

o 
o 

CO 


^  a 

CD 


Logariihm  of 
Relative  Price 
Given  Year 
times  Weight 

w  log  pt/po 

(10) 

O  p? 

^T^*  - 
Pi, 

bC  W3 

,-1  «^ 
Si  ?^  . 

St  o 

Pi. 

bC 
O 

s: 

Si 

o 

Pi. 

P5, 
bC 

Logarithm  of 

Relative 
Price  Given 
Year 

log  pt/po 

(9) 

OS^  o 
^  P^ 

^  Pin 

bf)  bO 

K  O 

bC 

Weight 

w 
(8) 

Relative 
Price 
Given 
Year 

(7) 

PiL 

'  ~\ 
'  C^  - 

Value 

Base 

Year 

(6) 

O    *    »   ' 

o 

P5,  .3= 
.  Pi, 

g  §  S  <§, 

O  St  o  'CO 

'  

Price 
Given 
Year 

(4) 

Price 
Base 
Year 

Po 

(3) 

o*t  o     •  si  o 

Unit 
(2) 

Commodity 
(1) 

1st  Commodity 
2nd  Commodity 

nth  Commodity 

Totals 

en  -t^ 


^  P3 
^  bC 

1/2 

Q 


c3 
03 


O 


o 
Pi, 


b£ 
O 


II 
• 

o 
bC 

II 

a 

S3 
X 

d 

bO 


11 

°3 
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53.   SUMMARY  AND  EXTENSION 

In  our  treatment  of  the  index  numbers  of  prices,  we  have  given 
consideration  to  the  following  types,  to  some  of  which  we  have  de- 
voted considerable  attention.  In  addition  to  the  median  of  the 
simple  relatives,  we  have  considered  the  following: 

1.  Simple  aggregative  relative: 

2.  Simple  arithmetic  mean  of  relatives:  t;^^ 

3.  Simple  geometric  mean  of  relatives:  V/  11— i 

^  Vo 

N 

4.  Simple  harmonic  mean  of  relatives: 


S(pogo)  log  — 
Po 


5.  Weighted  aggregative  relative:       ■„  '  ^ 
(weights  =  base  period  quantities)  ^^^^^ 

6.  Weighted  aggregative  relative: 
(weights  ==  given  period  quantities) 

7.  Weighted  arithmetic  mean  of  relatives: 
(weights  =  base  period  values)  ^VoQo 

8.  Weighted  geometric  mean  of  relatives:  log  Mg  =   —  

(weights  =  base  period  values)  ^PoQo 

9.  Weighted  harmonic  mean  of  relatives:  — ^£222 — 
(weights  =  base  period  values)  S(po5o)  ^ 

Pi- 
Other  useful  types  may  be  developed  by  devising  different  systems 
of  weights.  Suppose  we  weight  the  base  period  prices  po  by  base 
period  quantities  go,  and  the  given  period  prices  pi  by  the  given 
period  quantities  Qt.  The  ratio  of  the  aggregate  value  in  the  given 
period  to  the  aggregate  value  in  the  base  period  gives  the  value  index 
oVi.  We  thus  have 

10.  Weighted  aggregative  relative.  Value  index:  oVi  =  ^ 
(weights  base  period  =  base  period  quantities)  ^PoQa 
(weights  given  period  =  given  period  quantities) 


SUMMARY  AND  EXTENSION 
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Other  aggregative  relatives  may  be  obtained  by  choosing  as  weights 
averages  of  the  base  period  quantities  go  and  the  given  period  quan- 
tities Qi.  Thus  we  may  choose  as  weights 

which  are  respectively  the  arithmetic  mean,  the  geometric  mean,  and 
the  harmonic  mean  of  Qq  and  qi.  Employing  these  weights,  we  have 
the  additional  aggregative  indexes. 


11.  Weighted  aggregative  relative: 


(weights  =  (go  +  ^^)/2)  ^p^^^S^-±JA     ^P'^^'  + 


12.  Weighted  aggregative  relative:  "^^'^ 
(weights  -  V^)  SpoVm* 


13.  Weighted  aggregative  relative:   ^*  =  ^* 

(weights  =  2Mv/(go  +  q^)        ^po  2po 

go  +  qi  qo  +  Qi 

The  formulas  listed  in  10,  11,  12  and  13  above  take  into  account 
not  only  the  varying  prices  but  the  varying  quantities  as  well.  They 
have  the  disadvantage  of  requiring  the  quantities  Qi  at  time 
which  are  not  always  available.   The  formula  listed  in  11,  namely, 

p  ^  2p^(qo  4-  Qi) 

2po(go  +  Qi) 

is  the  Fisher^s  2153  which  has  met  wide  approval.^  Due  to  its 
simplicity  and  the  facility  of  its  computation.  Professor  Fisher  has 
proposed  its  use  as  a  substitute  for  his  "Idear^  index  (see  page  198). 

In  a  similar  manner  wc  may  construct  other  index  numbers  that 
are  weighted  averages  of  relatives  by  devising  various  systems  of 
weights.  In  Section  52,  we  recommended  the  use  of  values  as  weights 
for  the  abstract  relatives.  Professor  Fisher  has  outlined  the  following 
methods  of  weighting  by  values.^ 

1  Irving  Fisher,  The  Making  of  Index  Numbers,  1927,  p.  284. 
^  Irving  Fisher,  op.  ext.,  p.  64. 
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I.  Each  weight  =  base  period  price  X  base  period  quantity:  po^o 

II.  Each  weight  =  base  period  price  X  given  period  quantity:  poQi 

III.  Each  weight  =  given  period  price  X  base  period  quantity:  p^qo 

IV,  Each  weight  =  given  period  price  X  given  period  quantity:  p^qi 

We  have  previously  used  poqo  as  weights  for  the  relatives  Pi/po 
in  deriving  the  arithmetic  mean  of  relatives  given  by  7,  the  geometric 
mean  of  relatives  given  by  8,  and  the  harmonic  mean  of  relatives 
given  by  9.  Let  us  now  use  the  values  piQi  of  the  given  period  as 
weights.  We  have  ^ 

^Piqi  ^ 

14.  Weighted  arithmetic  mean  of  relatives:  — ^;  

(weights  =  given  period  values  piq^)  ^VtQ.i 

Pi 


Xp.qi  log 

15.  Weighted  geometric  mean  of  relatives:  log  Mg  =  

(weights  =  given  period  values  ptqi) 


^Pxqi 


16.  Weighted  harmonic  mean  of  relatives:  Mh  — 
(weights  =  given  period  values  p^qi) 


^Piq* 


Po  ^Poqi 
Pi 


EXERCISES 


1.  Compute  the  value  index  for  grains  —  formula  10,  Section  53  —  for 
the  year  1925.   The  data  are  given  in  Table  37. 

2.  Compute  the  weighted  aggregative  relative  for  the  prices  of  grains 
by  formula  11,  Section  53.   The  data  are  given  in  Table  37. 


3. 


Table  40.  Production  and  Farm  Price  of  the 
Principal  Farm  Crops  in  the  United  States 


Production 

Unit  Price 

Crop 

Unit 

{millions) 

(dollars) 

1913 

1921 

1929 

1913 

1921 

1929 

Corn  

bu. 

2447 

3069 

2622 

0.69 

0.42 

0.78 

Wheat  

bu. 

763 

815 

807 

0.80 

0.93 

1.04 

Oats  

bu. 

1121 

1078 

1239 

0.39 

0.30 

0.44 

Barley  

bu. 

178 

155 

307 

0.54 

0.42 

0.55 

Rice  

bu. 

26 

38 

40 

0.86 

0.95 

0.98 

Potatoes  

bu. 

332 

362 

357 

0.69 

0.92 

1.31 

Apples   

bu. 

145 

99 

143 

0.98 

1.68 

1.32 

Sweet  Potatoes . 

bu. 

59 

99 

85 

0.73 

0.88 

0.95 

Cotton  

bale 
500  lbs. 

14 

8 

15 

61.00 

81.00 

82.00 

Tobacco  

lb. 

954 

1070 

1501 

0.13 

0.20 

0.10 

Hay  

ton 

64 

82 

102 

12.43 

12.10 

12.23 

SUMMARY  AND  EXTENSION 
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Using  1913  as  the  base  year,  compute,  for  Table  40  on  the  preceding 
page,  indexes  for  the  years  1921  and  1929: 

(1)  by  formula  5,  Section  53 

(2)  by  formula  6,  Section  53 

(3)  by  formula  11,  Section  53 

(4)  by  formula  10,  Section  53 

(5)  by  formula  8,  Section  53 

(6)  by  formula  14,  Section  53 

54.  BIAS 

It  is  a  well-known  theorem  of  algebra  that  if  A  is  the  arithmetic 
mean,  G  the  geometric  mean,  and  //  the  harmonic  mean  of  a  set  of 
numbers,  then  H  <  G  <  A}  That  is,  the  simple  harmonic  mean  is 
less  than  the  simple  geometric  mean  which,  in  turn,  is  less  than  the 
simple  arithmetic  mean.  In  averaging  a  group  of  simple  relatives, 
the  arithmetic  mean  tends  to  give  a  value  too  large  and  the  harmonic 
mean  a  value  too  small  to  be  a  fair  representation  of  the  facts.  In 
more  technical  language,  the  arithmetic  mean  is  said  to  have  an 
upward  bias  and  the  harmonic  mean  a  downward  bias.  In  contrast 
with  the  weight  bias,  to  be  considered  later,  the  bias  arising  from  the 
form  of  average  used  is  called  type  bias. 

The  existence  of  bias  in  the  simple  arithmetic  and  the  simple 
harmonic  means  can  be  explained  in  another  manner,  namely, 
through  the  use  of  the  time  reversal  test.  The  time  reversal  test 
requires  that  the  product  of  the  index  for  any  given  period  on  the 
base  period  and  the  index  for  the  base  period  on  the  given  period 
should  equal  unity.   In  symbols,  the  time  reversal  test  requires  that 

It  is  very  easy  to  show  that  the  simple  geometric  mean  of  relatives 
satisfies  this  test  and  that  the  simple  arithmetic  and  the  simple 
harmonic  means  of  relatives  do  not  satisfy  it.  The  simple  geometric 
mean  is  thus  without  type  bias.  It  has  been  observed  by  the  makers 
of  index  numbers  that,  when  the  simple  arithmetic  mean  and  the 
simple  harmonic  mean  are  crossed  (averaged)  geometrically,  the  bias 
is  considerably  reduced.  The  fact  that  the  simple  geometric  mean  is 

^  For  a  proof,  see  Robert  W.  Burgess,  Introduction  to  the  Mathematics  oj 
Statistics,  1927,  p.  101. 
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without  type  bias  means  that  the  index  number  obtained  as  a  simple 
geometric  mean  of  relatives  is  independent  of  the  period  taken  as 
a  base.  These  facts  give  the  geometric  mean  remarkable  merit  in 
index  number  construction. 

When  weights  are  applied  in  the  construction  of  index  numbers, 
another  kind  of  bias  —  weight  bias  —  appears.  Each  system  of 
weighting  has  its  bias.  Weighting  the  relatives  by  base  period  values 
Po^o  produces  downward  bias  while  weighting  the  relatives  by  given 
period  values  PtQt  produces  upward  bias.  The  weighted  arithmetic 
mean  and  the  weighted  harmonic  mean  of  relatives  may  have  both 
type  bias  and  weight  bias.  If  the  base  period  values  are  employed 
in  the  construction  of  the  weighted  arithmetic  mean  of  relatives,  the 
net  bias  will  likely  be  small.  Similarly,  if  given  period  values  are 
employed  as  weights  in  the  construction  of  the  weighted  harmonic 
mean  of  relatives,  the  net  bias  will  likely  be  small.  Further,  as  the 
net  bias  of  the  arithmetic  mean  of  relatives,  weighted  by  base  period 
values,  has  been  observed  to  be  in  the  opposite  direction  to  the  net 
bias  of  the  harmonic  mean  of  relatives,  weighted  by  given  period 
values,  crossing  these  two  indexes  geometrically  should  produce  an 
index  practically  free  from  bias. 


65.  FISHER'S  IDEAL  INDEX 

Let 

A     weighted  arithmetic  mean  of  relatives: 

(weights  «  base  period  values  po^o)  ^^^^ 

and 


H  =  weighted  harmonic  mean  of  relatives: 


(weights  =  given  period  values)  ^PiQi  — 

the  geometric  mean  of  A  and  H,  VaH, 


is  known  as  Fisher^s  Ideal  Index  Number.^  This  index  is  not  only 
the  geometric  mean  of  a  weighted  arithmetic  mean  and  a  weighted 
harmonic  mean  of  relatives;  it  is  clearly  a  geometric  mean  of  two 

^  Irving  Fisher,  op.  dLj  p.  220. 
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aggregative  relatives.  The  formula  requires  both  price  and  quantity 
data  for  each  period  to  which  the  index  applies.   Since  the  data  for 
quantities  are  frequently  diflScult  to  secure,  the  practical  usefulness 
of  the  Ideal  index  is  to  some  extent  limited. 
Interchanging  0  and  i  throughout  the  formula,  we  have 


Evidently  oPi  •  iPo  equals  unity,  and  hence  the  Ideal  formula  sat- 
isfies the  requirements  of  the  time  reversal  test. 

A  second  test  of  validity  —  a  test  strongly  recommended  by  Pro- 
fessor Fisher  —  is  the  factor  reversal  test.  This  test,  states  Professor 
Fisher,  ^* ought  to  permit  interchanging  prices  and  quantities  without 
giving  inconsistent  results  —  i.e.,  the  two  results  multiplied  together 
should  give  the  true  value  ratio. ^ 

If  in  the  Ideal  formula  we  replace  every  p  by  a  g  and  every  g  by  a  p, 
we  have 


which  is  the  true  value  index.  Consequently,  the  Ideal  formula 
meets  completely  the  factor  reversal  test.  This  means,  of  course, 
that  the  formula  serves  equally  well  for  constructing  indexes  of 
quantities  as  for  constructing  indexes  of  prices,  the  quantity 
index  being  derived  by  interchanging  p  and  q  in  the  Ideal  formula 
for  oPi. 

None  of  the  simple  or  weighted  forms  of  the  elementary  indexes 
—  arithmetic  mean,  harmonic  mean,  geometric  mean  —  fulfill  the 
requirements  of  the  factor  reversal  test.  It  is  thus  obvious  that  the 
strong  restrictions  imposed  by  the  factor  reversal  test  compel  its 
being  ignored  in  the  construction  of  many  highly  reputable  index 
numbers. 


Multiplying  oP»  and  oQi  together,  we  have 


*  Irving  Fisher,  op,  cU.,  D.  72. 
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CONCLUSION 

In  our  treatment  of  index  numbers,  we  have  not  attempted  to  do 
more  than  touch  upon  the  important  phases  of  the  subject.  No 
attempt  has  been  made  to  make  the  treatment  exhaustive.  We  have 
consciously  tried  to  avoid  controversial  issues.  The  student  who 
desires  a  comprehensive  treatment  of  the  subject  should  read  the 
following  treatises: 

Irving  Fisher,  The  Making  of  Index  Numbers^  Houghton,  Mifflin  Company, 
1927. 

Wilford  1.  King,  Index  Numbers  Elucidated^  Longmans,  Green  and  Company, 
1930. 

Wesley  C.  Mitchell,  Index  Numbers  of  Wholesale  Prices  in  the  United  States 
and  Foreign  Countries^  Bulletin  Number  284  of  the  United  States  Bureau 
of  Labor  Statistics,  1921. 

C.  M.  Walsh,  The  Problem  of  Estimation,  King  and  Son,  London,  1921. 


EXERCISES 

1.  If  A  is  the  simple  arithmetic  mean  and  H  is  the  simple  harmonic 

mean  of  a  group  of  relatives,  show  that  their  geometric  mean,  V  AH,  is  an 
index  that  fulfills  the  time  reversal  test.  Evaluate  this  index  for  eliminating 
type  bias. 

2.  If  A  is  the  simple  arithmetic  mean  and  H  is  the  simple  harmonic 

A  H 

mean  of  a  group  of  relatives,  show  that  their  arithmetic  mean,  — - — »  and 

2  AH 

their  harmonic  mean,  — ; — ~  >  do  not  satisfy  the  time  reversal  test. 

'  A  +  H  ^ 

3.  Using  VpoPt  as  the  base  prices,  the  simple  arithmetic  mean  of  rela- 
tives for  the  base  year  and  for  the  given  year  are  respectively  given  by 


Po  ,      .  1  XJ^  Pi 


Show  that  the  index  I  =  ^It/^o  fulfills  the  time  reversal  test. 

4.  Show  that  the  simple  geometric  mean  of  relatives  fulfills  the  time 
reversal  test. 

6.  Show  that  the  simple  arithmetic  mean  of  relatives  and  the  simple 
harmonic  mean  of  relatives  do  not  satisfy  the  time  reversal  test. 

6.  Using  the  results  of  the  computations  of  Tables  36  and  37,  find 
Fisher's  Ideal  index  for  grain  prices  in  the  United  States  in  1925. 

7.  Using  the  results  of  Exercises  3  (1)  and  3  (2),  page  196,  find  Fisher's 
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Ideal  index  for  the  prices  of  the  principal  farm  crops  in  the  United  States  in 
1921.  Do  the  same  for  1929. 


8.  Table  41.  Production  and  Wholesale  Price  of  Mineral 
Products  in  the  United  States  for  the  Years 

1919,  1921,  AND  1923 


Production 

Unit  Price 

Product 

Unit 

{millions) 

(dollars) 

1919 

1921 

1923 

1919 

1921 

1923 

Pig  Iron  

long  ton 

30.6 

16.6 

40.0 

28.97 

22.58 

26.29 

Copper  

lb. 

1286.0 

575.0 

1667.0 

0.19 

0.13 

0.15 

Anthracite  Coal 

short  ton 

88.1 

90.5 

95.5 

8.27 

10.53 

10.98 

Bituminous  Coal 

short  ton 

465.9 

415.9 

564.2 

2.34 

2.19 

2.27 

Coke  

short  ton 

44.2 

25.3 

55.5 

4.58 

3.45 

5.34 

Petroleum   

bbl. 

378.4 

469.6 

732.4 

2.28 

1.70 

1.44 

With  1919  as  the  base  year,  compute  indexes  for  the  years  1921  and  1923: 

(1)  by  formula  5,  Section  53 

(2)  by  formula  6,  Section  53 

(3)  by  the  formula  for  Fisher's  Ideal  using  the  results  of  the  two  pre- 
ceding indexes 

(4)  by  formula  10,  Section  53 

(5)  by  formula  11,  Section  53 

9.  (Davies  and  Crowder.)  Compute  the  weighted  aggregative  relative 
index  number  of  the  prices  of  farm  products  in  Iowa  in  1925  on  a  1910- 
1914  base. 


Commodity 

Weights 
Q 

Prices 
1910-1914 
Po 

1925 
P» 

Hogs  

5.17  cwt. 

$7.30 

$11.08 

Cattle  .... 

3.85  cwt. 

6.39 

8.43 

Sheep  .... 

0.21  cwt. 

4.51 

7.48 

Corn  

24.98  bu. 

0.53 

0.86 

Oats  

19.12  bu. 

0.35 

0.39 

Wheat .... 

1.03  bu. 

0.85 

1.44 

Hay  

0.10  ton 

9.82 

11.23 

Butter .... 

40.62  lb. 

0.25 

0.41 

Eggs  

19.56  doz. 

0.17 

0.27 

Poultry .  . . 

14.58  lb. 

0.10 

0.18 

.  Total 
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10.  Show  that  the  simple  aggregative  relative  fulfills  the  time  reversal 
test. 

11.  Show  that  the  weighted  aggregative  relative  (weights  the  base 
period  quantities)  does  not  fulfill  the  time  reversal  test  nor  the  factor 
reversal  test. 

12.  Show  that  the  weighted  aggregative  relative  (weights  the  given 
period  quantities)  does  not  fulfill  the  time  reversal  test. 

13.  The  following  data  are  taken  from  the  Statistical  Abstract  of  the 
United  States,  1936,  p.  632. 

Compute  the  price  indexes  for  these  data  by  using  (1)  formula  5(a) 
and  (2)  formula  5(b). 


Grain 

Price 
(cents) 

Production 
(millions  of  bushels) 

1930 
Po 

1934. 
P» 

1930 
Qo 

1934 
q^ 

PoQo 

PoQv 

PiQi 

Com  

Wheat. . . . 

Oats  

Rye  

Barley  

Buckwheat 
Rice  

59.6 
67.1 
32.2 
44.5 
40.5 
78.8 
78.4 

81.5 
84.8 
48.0 
71.8 
68.6 
58.6 
79.0 

2080 
886 
1275 
45 
300 
7 
45 

1478 
526 
542 
17 
117 
9 
39 

Total 
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56.  INTRODUCTION 

The  foregoing  chapters  have  been  devoted  mainly  to  the  problem  of 
securing  a  brief  numerical  description  of  the  simple  frequency  dis- 
tribution. We  have  been  enabled  to  describe  the  characteristic 
properties  of  a  distribution  —  the  central  tendency,  the  variabiUty, 
the  skewness,  and  the  excess  —  by  means  of  a  few  statistical  con- 
stants. More  briefly,  we  may  say  that  we  have  been  able  to  compress 
the  relevant  information  into  four  measures: 

At  ,       (7  ,       asj      a4  —  3, 

that  are  essentially  the  first  four  moments  of  the  distribution.  Addi- 
tional information  could  be  secured  by  fitting  an  appropriate  fre- 
quency function  to  the  observed  data.  Inasmuch  as  the  general 
problem  of  describing  frequency  distributions  by  means  of  equations 
is  beyond  the  scope  of  this  text,  no  such  refinements  will  be  generally 
attempted.^ 

Certain  types  of  data,  however,  admit  descriptions  by  means  of 
simple  equations,  and  it  is  to  them  that  we  now  turn  our  attention. 

It  should  be  kept  in  mind  that  our  problem  here  is  inverse  to  a 
kindred  problem  in  elementary  algebra.  There  we  were  given  the 
equation  that  expressed  the  relationship  between  X  and  Y.  We 
found  sets  of  values  of  X  and  F,  plotted  them,  and  drew  the  graph 
which  was  a  pictorial  representation  of  the  relationship.  Here,  we 
have  the  pairs  of  values  that  have  come  from  observation;  we  plot 
them.  They  seem  to  lie  upon  or  nearly  upon  a  regular  curve;  that  is, 
there  is  an  apparent  mathematical  relationship  between  the  variables. 
What  is  the  equation  that  expresses  exactly  or  approximately  this 
relationship?  Is  there  a  summarizing  constant  that  can  be  used  to 
measure  the  degree  of  this  relationship? 

^  Distributions  that  may  be  appropriately  represented  by  the  normal  curve  are 
considered  in  Chapter  12. 
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In  this  chapter  we  shall  be  concerned  with  data  that,  we  assume, 
obey  the  simplest  mathematical  law,  the  linear  or  straight-line  law. 
Before  we  proceed  to  the  real  problem  of  the  chapter,  it  is  advisable 
that  we  devote  some  attention  to  some  of  the  analytical  properties 
of  a  straight  line. 

57.  SOME  CHARACTERISTIC  PROPERTIES 
OF  A  STRAIGHT  LINE 

If  two  values  of  a  variable,  X,  are  given,  we  denote  their  difference 
by  AZ  (read:  delta  ex).  This  does  not  mean  A  multiplied  by  X. 
It  is  merely  a  short  way  of  writing,  *'the  difference  between  the  two 
values  of  X.^'  Thus,  if  the  values  of  X  are  5  and  9,  then: 

AX  =  9  -  5  =  4 

In  general,  if  Xi  and  X2  are  two  values  of  X: 

AX  =  Z2  -  Xi 

(Unless  otherwise  specified,  a  difference  designated  by  A  will  be 
taken  in  the  order  second  minus  first.)  Similarly,  AF  means  ^^the 
difference  between  the  two  values  of  F.'^  Thus,  if  the  values  of  F 
are  —  2  and  4: 

AF  =  4  -  (-  2)  =  6 

Consider  the  line  AB  of  Figure  20  with  the  two  points  A  (2,  3)  and 
B  (5,  7)  upon  it. 

Figure  20 


AX  =  5  ~  2  =  3  =  AC  =  MiV 
AF  =  7-  3  =  4- CB 
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For  the  more  general  case,  we  have  for  the  two  points  Pi(Zi,  Yi) 
and  P2(-S^2,  Y2).  Here: 

AX  =  X2  -  Zi  =  PiC  =  MN,  and 
AF  =  F2  -      =  CP2 

AF 

For  any  two  points  on  a  straight  Une,  the  ratio         gives  the  slope 

of  the  hne  (see  Figure  21).  It  is  usually  designated  by  m.  Hence: 

Fa  -  Fi      Fi  -  F2 


m  =  slope  of  P1P2  = 


X2  —  Xi     Xi  —  X2 

Figure  21 

P. 


(1) 


I 


0  M 


N 


Thus  the  slope  of  a  line  between  two  points  is  equal  to  the  difference 
of  the  F-coordinates  of  the  points  divided  by  the  difference  of  their 
X-coordinates,  subtracted  in  the  same  order.  It  also  means  the  change 
in  F  due  to  a  unit  change  in  X. 


EXERCISES 

Draw  the  lines  determined  by  the  following  pairs  of  points,  and  find  their 
slopes : 

1.  (3,  2)  and  (5,  7)  4.  (-  2,  3)  and  (5,  -  7) 

2.  (-2,-3)  and  (3,  2)  5.  (-  2,  3)  and  (2,  3) 

3.  (3,  2)  and  (5,  -  7)  6.  (3,  -  4)  and  (-2,-4) 

7.  Construct  a  line  through  (0,  0)  with  the  slope  equal  to  2. 

8.  Construct  a  line  through  (0,  3)  with  the  slope  equal  to  2. 

9.  Assuming  that  F  =  3X  +  4  is  the  equation  of  a  straight  line,  find 
its  slope.   (Hint:  Find  two  points  upon  the  line.) 

10.  Assuming  that  F  =  —  2X  +  4  is  the  equation  of  a  straight  line, 
find  its  slope. 

11.  Prove  by  means  of  slopes  that  (1,  -  3),  (2,  3),  and  (3,  9)  lie  on  the 
same  straight  line. 
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Figure  22 


It  was  probably  observed  in  the  exercises  on  page  205  that  the 
slope  of  a  line  may  be  positive,  negative,  or  zero.  If  the  line  rises  as 

we  proceed  from  left  to  right,  AF  and 
AX  have  the  same  sign,  and  the  slope 
is  positive.  If  the  line  falls  as  we  pro- 
ceed from  left  to  right,  AK  and  AX 
have  opposite  signs,  and  the  slope  is 
negative.  If  the  line  is  horizontal  as 
we  proceed  from  left  to  right,  Y2  =  Fi, 
and  hence  the  slope  equals  zero. 

Thus  in  the  figure  we  have  three 
lines  through  the  point  R.  The  slope 
of  AB  is  positive;  the  slope  of  CD 
is  negative;  the  slope  of  EF  is  zero. 
In  solving  Exercise  11  on  page  205,  the  student  probably  assumed 
that  if  two  segments  P1P2  and  P2P3  have  a  point  P2  in  common,  and 
the  same  slope,  the  three  points  Pi,  P2,  and  Ps  are  in  the  same  straight 
line.  This  theorem  and  its  converse  are  characteristic  properties  of  a 
straight  line. 


58.   THE  EQUATION  OF  A  STRAIGHT  LINE 

In  elementary  algebra  the  student  has  drawn  graphs  of  certain 
given  equations.  Our  problem  now  is  to  find  the  equation  when  the 
graph  is  given.  That  is,  we  must  express  in  some  algebraic  way  the 
relation  between  X  and  Y  of  any  point  on  the  line. 

For  example,  if  a  point  is  anywhere  on  the  X-axis,  the  F-co6rdinate 
is  always  zero.   We  express  this  simply  by  the  equation : 

F  =  0 

This  equation  is  therefore  the  equation  of  the  X-axis. 
Similarly,  the  equation  of  the  F-axis  is: 

X  =  0 

What  is  the  equation  of  a  line  parallel  to  the  X-axis  and  two  umts 
above  it? 

What  is  the  equation  of  a  line  parallel  to  the  F-axis  and  two  units 
to  the  right  of  it? 

Again,  if  a  line  bisects  the  first  and  third  quadrants,  evidently 

F  =  X 
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Figure  23 


Figure  24 


P{XJ) 


for  any  point  P  on  the  line.  That  is,  F  =  X  is  the  equation  of  the 
line  which  bisects  the  first  and  third  quadrants  (see  Figure  23), 

Let  us  now  find  the  equation  of  the 
Une  given  in  Exercise  8  on  page  205. 
We  have  B  =  (0,  3),  and  m  =  2. 

Let  P(X,y)  be  any  point  on  the 
fine  (see  Figure  24). 
By  definition: 

the  slope  =  ^  _  ^  =  2 

or 

F  =  2X  +  3 

Note  that  if  the  equation  is  solved 
for  F,  the  slope  is  the  coefficient  of  X, 
The  distance  OB  cut  off  on  the  F- 
axis  is  called  the  Y-intercept,  The 
F-intercept  in  the  equation  above  is 
the  constant  term,  3. 

What  is  the  X-intercept,  0^4? 
We  shall  now  turn  to  the  problem 
of  finding  the  equation  of  the  fine 
through  (0,6)  \vith  the  slope  equal  to 
m,  that  is,  the  line  whose  F-intercept 
is  b  and  whose  slope  is  m. 

Let  P(X,F)  be  any  point  on  the 
line. 

By  definition: 

the  slope  =  ^  _  ^  *  m 

or 

F  =  m  J  +  b  (2) 

Equation  (2)  is  known  as  the  slope- 
intercept  equation  of  the  straight  line. 

If  the  fixed  point  is  not  on  the  F- 
axis  the  equation  takes  a  different 
form.   Suppose  we  wish  to  find  the  equation  of  the  line  through  the 
point  (2,  1)  with  the  slope  equal  to  3  (see  Figure  26). 
We  let  P(Z,  F)  be  any  point  on  the  line. 


CD=£BC 


X 


Figure  25 
Y 


(X,F) 
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Figure  26 


CD=3'Pi  C 


X 


Figure  27 


0 


-X 


Figure  28 


By  definition: 
the  slope  = 


Y  -  1 


P(X,y)  or 


Z  -  2  " 
or         y  -  1  =  3(Z  -  2) 
and  finally 

F  =  3Z  -  5 

What  is  the  Z-intercept  of  the  line? 
the  F-intercept? 

In  general,  let  Pi(Zi,  Yi)  be  the 
fixed  point,  and  m  the  given  slope. 

As  before,  let  P(Z,  Y)  be  any  other 
point  on  the  line.  Then,  by  definition : 

the  slope  =   ^  =  m 

Ji.  —  A 1 

y  -  7i  =  m(Z  -  Zi)  (3) 


The  equation  (3)  is  called  the  point- 
slope  equation  of  the  straight  line.  Of 
course  (2)  is  a  special  case  of  (3). 

The  point-slope  form  is  very  useful 
in  finding  the  equation  of  a  line  when 
two  points  on  the  line  are  given.  We 
can  determine  the  slope  by  equation 
(1),  then  we  may  use  equation  (3)  with 
either  of  the  given  points  as  the  point 
Pi(Zi,  Fi). 

Thus,  let  us  find  the  equation  of  the 
Une  through  the  points  (2,  1)  and 
(6,  4). 

Here  we  have : 


the  slope  =  m 

Now  using  equation  (3),  we  have  either: 

a.  y  -  1  =  f  (Z  -  2) 
or  b.  y  -  4  =  |(Z  -  6) 

In  either  case,  we  obtain: 

3Z  -  47  =  2 

What  are  the  X-  and  F-intercepts  of  this  line? 


3 
4 
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EXERCISES 

1.  Construct  the  line  through  (0,  2)  with  m  =  3,  and  find  its  equation. 

2.  Construct  the  Une  through  (0,  —  2)  with  m  =  3,  and  find  its  equation. 

3.  Construct  the  Une  through  (0, 2)  with  m  =  —  3,  and  find  its  equation. 

4.  Construct  the  Une  through  (0,  —  2)  with  m  =  —  3,  and  find  its  equa- 
tion. 

6.  Determine  the  type  form  of  each  of  the  foUowing  equations.  Name 
the  two  conditions  given.   Use  that  knowledge  in  drawing  the  graph. 

a.  y  =  3Z  -  4  d.   r  =  -|Z 

b.  F-3  =  2(X-5)  e.    7  -  2  =  3(Z  +  4) 

c.  y  =  2X  f.  r  =  X  +  5 

6.  State  the  equations  of  the  straight  lines: 

a.  Through  (2,  3)  with  slope  5 

b.  Through  (0,  5)  with  slope  f 

c.  Through  (6,  2)  with  slope  -  1 

7.  Show  that  AX  +  BY  +  C  =^  0,  {B  9^  0),  is  the  equation  of  a 
straight  line. 

8.  A  straight  line  passes  through  the  points  (3,  5)  and  (8,  12).  Find  its 
equation  and  its  X-  and  F-intercepts. 

9.  Prove  that  if  two  nonvertical  lines  are  paraUel,  their  slopes  are  equal. 
State  and  prove  the  converse. 

10.  Prove  that  if  two  lines  are  perpendicular  to  each  other,  the  product 
of  their  slopes  is  —  1.   State  and  prove  the  converse. 

Let  the  two  lines  intersect  at  C. 
Figure  29  L^y  off  CAi  =  CA2,  and  draw  the 

parallels  to  the  axes  as  shown  in  the 
figure.   Then : 

angle  ai  =  angle  02 

(the  sides  being  perpendicular  each 
to  each). 

Hence  the  triangles  CAiBi  and 
CA2B2  are  congruent  (why?)  and 

CBi  =  B2A2 
CB2  =  -  BiAi. 

(For  CBi  is  positive  and  BiAi  is 
negative.) 

BUi 
CBi 
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and  the  slope  of  line  (2)  is:  ^ 

Therefore:  /B^AA/B,A,\     /B,A,\  f  CB,  \ 

'"""^  =  KcBi)  [-cb:)  =  [cb:)       =  -  ^ 

The  proof  of  the  converse  is  left  as  an  exercise  for  the  student. 

11.  Show  that  y  =  2Z  -  2,  Y  =  2X,  and  7  =  2X  +  4  are  parallel 
lines. 

12.  Show  that  the  lines  3X  +  27  =  6  and  -  2Z  +  37  =  6  are  per- 
pendicular. 

13.  Find  the  equation  of  a  line  through  (2,  5)  parallel  to  7  =  3X  +  2. 

14.  Find  the  equation  of  a  line  through  (2,  3)  and  perpendicular  to 
3Z  -  47  =  8. 

15.  Are  the  points  (2,  7),  (5,  13),  (9,  21),  (15,  33)  on  a  straight  line? 

16.  Are  the  points  (1,  5),  (3,  10),  (5,  13),  (7,  16)  on  a  straight  line? 
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Figure  30 


A.  The  Method  of  Least  Squares.  Many  observed  data,  when 
plotted,  give  a  set  of  points  that  seem  to  lie  somewhat  closely  upon 
a  curve.  (As  an  illustration,  see  the  data  of  automobile  fatalities 
on  page  103.)  This  suggests  to  us  that  the  data  may  approxi- 
mately follow  some  simple  mathe- 
matical law.  It  is  not  necessary 
that  any  of  the  points  lie  upon  the 
curve  selected  to  describe  the  data, 
but  they  will  likely  be  distributed 
above  and  below  the  curve  as  the 
figure  indicates. 

Let  Pi,  P2,  Pzy  Pi,  etc.  be  several 
points  determined  by  the  data.  The 
curve  indicating  their  general  trend 
is  called  an  empirical  curve.  The  difference  between  the  ordinate  of  a 
given  point  and  the  ordinate  of  the  corresponding  point  on  an  em- 
pirical curve  is  called  the  Y -residual  of  that  point.  That  is: 


Pn  (read:  rho  enn)  = 

.J    ,    r ordinate  of"l 
any  7-residual  =     .         •  . 

L  given  pomtj 


("ordinate  of  correspond-1 
L    ing  point  of  curve  J 
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Thus  the  F-residuals  of  the  points  Pi,  P2,  P3,  Pa,  are  respectively 

PiQh  P2Q2)  PzQzf  P4Q4. 

Let  us  consider  now  the  following  data,  which  were  derived  in  an 
experiment  in  a  physics  laboratory  in  connection  with  the  problem 
of  finding  the  relation  between  the  resistance  in  ohms  of  a  certain 
coil  of  wire  and  its  temperature,  the  temperature  to  be  kept  between 
10^  and  100°  C. 

Figure  31 


Table  42 


t 

R 

10.5 

10.42 

29.5 

10.94 

42.7 

11.32 

60.0 

11.80 

75.5 

12.24 

91.1 

12.67 

10    20    SO    40    60    60    70    80    90  100 

When  these  points  are  plotted  with  t  as  the  independent  variable 
and  R  the  dependent  variable  they  lie  close  to  a  straight  line.  (It 
can  be  shown  by  the  method  of  the  preceding  section  that  the  points 
do  not  lie  upon  a  straight  line.)  Allowing  for  errors  of  observation, 
we  may  assume  that  the  law  connecting  resistance  and  temperature 
is  linear,  and  our  problem  now  is  to  determine  the  equation  of  the 
straight  line  which  will  best  fit  the  given  data. 

What  is  to  be  regarded  as  a  best  fit  will  depend  upon  the  precise 
way  that  we  choose  to  define  the  term  best.  While  there  is  no  unique 
answer  to  the  question,  we  shall  define  the  best  straight  line  in  accord- 
ance with  what  is  called  the  principle  of  least  squares.  For  the 
straight  line  we  shall  state  as  follows  the  principle  of  least  squares: 
The  straight  line  best  fitting  a  set  of  points  is  that  one  in  which  the 
constants  are  so  determined  that  they  will  make  the  sum  of  the  squares 
of  the  residuals  a  minimum,^ 

Before  we  undertake  to  apply  this  principle  to  determine  the 
equation  of  a  straight  line  best  fitting  a  set  of  data,  let  us  examine 

^  For  other  methods  of  fitting  a  straight  line,  see  Section  81. 
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some  observed  data  to  which  several  straight  lines  have  been  fitted 
and,  adopting  the  principle  of  least  squares  as  a  criterion  for  the 
goodness  of  fit,  note  that  we  can  determine  which  of  several  lines 
is  the  best. 


Figure  32 


Table  43 


X 

Observed 
Y 

Computed 
7  ==  2X  -f  3 

1 

4 

5 

2 

8 

7 

3 

9 

9 

4 

10 

11 

5 

14 

13 

X 


Consider  the  observed  data  given  by  the  first  two  columns  of 
Table  43.  The  five  points  constituting  the  observed  data  are  plotted 
on  Figure  32.  On  this  set  of  axes  is  drawn  the  line  Y  =  2X  +  3. 
This  line  has  the  slope  of  2,  the  F-intercept  of  3,  and  it  passes  through 
the  point  (3,  9),  one  of  the  observed  points.  Two  of  the  observed 
points  are  above  the  line,  two  are  below  the  line,  and  one  is  on  the 
line.  Judging  by  the  graph,  the  line  is  a  reasonable  fit.  That  is, 
corresponding  to  the  given  values  of  X  the  computed  values  of  F, 
5,  7,  9,  11,  13  are  reasonably  near  the  corresponding  observed  values 
of  F,  4,  8,  9,  10,  14.  Stated  differently,  for  the  given  values  of  X, 
the  values  of  F  computed  from  F  =  2X  +  3  are  reasonably  close 
approximations  to  the  observed  values  of  F. 

Just  how  near  are  the  observed  points,  as  a  group,  to  the  line 
F  =  2X  +  3?  Let  us  answer  this  question  by  applying  the  principle 
of  least  squares  to  the  residuals  (see  end  of  page  211).  The  results 
are  given  in  Table  44. 

We  note  that,  based  upon  the  line  F  =  2X  +  3, 

Sp  =  0   and   Sp^  =  4. 

Suppose  that  we  now  consider  the  line  F  =  2.2X  +  2.4  with  the 
observed  values  given  in  Table  45.  If  the  student  will  plot  the 
observed  points  and  the  line  F  =  2.2X  +  2.4  on  the  same  axes,  he 
will  observe  that  this  line  also  passes  among  the  points  and  that  two 
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of  the  observed  points  are  above  the  line,  two  are  below  the  line,  and 
the  line  Y  =  2,2X  +  2.4  passes  through  the  obsei-ved  point  (3,  9). 

Table  44 


X 

Observed 
Y 

Computed 

y  =  2X  +  3 

Y-Residuals 
P 

{Y-Residuals)* 

1 

4 

5 

-  1 

1 

2 

8 

7 

+  1 

1 

3 

9 

9 

0 

0 

4 

10 

11 

-  1 

1 

5 

14 

13 

+  1 

1 

0  = 

4  =  2p2 

Thus  we  may  say  that  the  line  Y  =  2.2X  +  2.4  also  fits  the  data 
reasonably  close.  Just  how  closely  does  this  line  fit  the  data?  Again 
we  find  the  sum  of  the  squares  of  the  residuals  by  preparing  Table  45. 


Table  45 


X 

Observed 
Y 

Computed 
Y  =  2.2X  +  2.4 

Y-Residuals 

{Y-ResidualsY 

1 

4 

4.6 

-  0.6 

0.36 

2 

8 

6.8 

+  1.2 

1.44 

3 

9 

9.0 

0.0 

0.00 

4 

10 

11.2 

-  1.2 

1.44 

5 

14 

13.4 

+  0.6 

0.36 

0.0  =  Sp 

3.60  =  2p2 

If  the  algebraical  sum  of  the  residuals,  2p,  is  adopted  as  a  criterion 
for  the  goodness  of  fit,  of  the  two  lines  we  have  considered  one  fits 
as  well  as  the  other  since  for  each  line  Sp  =  0.  However,  if  we 
adopt  2(F-residuals)2,  Sp^,  as  the  criterion,  then  the  fine  Y  =  2.2X 
+  2.4  fits  more  closely  than  the  line  Y  =  2X  +  3.  As  a  matter  of 
fact  we  shall  soon  have  the  student  show  that,  based  upon  the  prin- 
ciple of  least  squares,  the  line  Y  =  2.2X  +  2.4  is  the  best  fitting  line 
to  the  observed  data  of  Table  44. 

We  shall  now  proceed  to  the  main  problem  of  the  section:  adopting 
as  a  criterion  of  the  goodness  of  fit  the  principle  of  least  squares,  to 
find  the  values  of  m  and  b  in  order  that  the  line  Y  =  mX  +  b  may 
best  fit  a  swarm  of  points.   We  shall  approach  the  general  problem 
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by  considering  first  a  simple  set  of  data,  namely,  that  given  in 
Exercise  16  on  page  210.  We  have  the  four  points  as  shown  in 
Figure  33. 


These  data  evidently  have  a  straight-line  trend.  Assume  that  they 
can  be  represented  by 

Y  =  mX  +  b 

for  proper  values  of  m  and  b.  Corresponding  to  X  =  1,  the  ordinate 
of  the  line  is  w  •  1  +  b;  corresponding  to  X  =  3,  the  ordinate  of  the 
line  is  m  •  3  +  6,  and  so  on.  Hence,  from  the  definition: 

The  1st  y-residual  =pi  =  5  —  (m  +  2>)  ==5~m--6 
The  2nd  F-residual  -  p2  =  10  -  (3m  +  ?>)  =  10  -  3m  -  6 
The  3rd  F-residual  =  ps  =  13  -  (5m  +  6)  =  13  -  5m  -  6 
The  4th  F-residual  =  p4  -  16  -  (7m  +  b)  =  16  -  7m  ~  6 

The  values  of  m  and  b  must  be  so  chosen  that  the  sum  of  the  squares 
of  the  F-residuals  is  a  minimum.  The  sum  of  the  squares  of  the  F- 
residuals  is  given  by: 

Sp2=  (5-  m  ~  6)2+  (10  -  3m  -  6)2  +  (13  -  5m  -  by  +  (16  -  7m  -  by 

This  result  may  be  written  either  as  a  quadratic  in  6  or  as  a  quad- 
ratic in  m.  We  have  then : 

a.  Sp^  =  462  +  (32^  -  88)6  +  (84m2  -  424m  +  550) 

b.  2p2  =  84m2  +  (326     424)m  +  (46^  -  886  +  550) 
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Recalling  the  theorem  of  Section  26  (p.  83)  to  the  effect  that 


Y  =  aX^  +  6X  +  C  is  a  minimum  when  X  = 


in  a.  is  a  minimum  when : 

,  ^  -  (32m  ~  88) 
8 

or  4771  +  6 

and  Sp^  in  b.  is  a  minimum  when: 

-  (32b  -  424) 
168 

21m  +  46 


-  b 
2a 


we  note  that  Sp^ 


m 


--  4m  +  11 
11 

-46  +  53 


21 


or 


=  53 


These  equations 


4m  +  6  «  11 
21m  +  46-53 

are  called  normal  equations.  If  they  are  solved  simultaneously,  we 
obtain  m  =  1.8 

6  =  3.8 

and  hence,  by  the  method  of  least  squares,  the  best-fitting  straight 
^^^^is-  Y  =  1.8X  +  3.8 

If  we  give  to  X  in  this  equation  the  values  1,  3,  5,  7,  we  obtain  the 
corresponding  computed  or  most  probable  values  of  F.  Thus: 

If  X  =  1,  F  =  1.8  +  3.8  =  5.6 
If  Z  =  3,    F  =  5.4  +  3.8  =  9.2 

Note  that  if  X  =  4,  F  =  11.  That  is,  the  point  (Mx^  My)  is  on  the 
line. 

In  the  following  table 

any  F-residual  =  F  observed  —  F  computed 

Table  47.    Observed  and  Computed  Values  of  Y  Compared 

BY  Means  of  Their  F-Residuals 


X 

Y  Observed 

Y  Computed 

Y-Residu(d^ 

(Y-Residuals)^ 

1 

5 

6.6 

-  0.6 

.36 

3 

10 

9.2 

0.8 

.64 

5 

13 

12.8 

0.2 

.04 

7 

16 

16.4 

-  0.4 

.16 

My  =  11 

0.0 

1.20 
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EXERCISE 

Find  the  equation  of  the  straight  line  that  best  fits  the  data  below.  Then 
find  the  computed  values  of  Y  and  compare  them  with  the  observed  values. 


Table  48 


X 

Y 

1 

1.1 

3 

6.8 

5 

12.6 

7 

19.0 

Let  us  now  generalize  the  procedure  by  fitting  the  line 

r  =  mX  +  6 

to  the  data  that  consist  of  n  sets  of  values  which  are  given  in  the  table. 


Figure  34 


We  assume  that  the  points  have  a  linear  trend,  as  shown  in  Fig- 
ure 34.  Our  problem  is  to  determine  the  values  of  m  and  6  for  the 
best-fitting  line. 

Corresponding  to  X  =«  Xi,  the  ordinate  of  the  line  is  mXi  +  6; 

^  The  student  should  note  especially  that  n  is  the  number  of  pairs  of  values  of 
Fand  X. 
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corresponding  to  X  -  X2,  the  ordinate  of  the  line  is  mXi  +  6,  etc. 
Hence  we  have: 

=  1st  y-residual  =  Fi  -  (mZi  +  6)  =  (Fi  -  mXi  -  b) 
p2  =  2nd  F-residual  =  F2  -  (mX2  +  6)  =  (F2  -  mXo  -  b) 


Pn  =  nth  F-residiial  =  Fn  —  (mZn  +  6)  =  (Fn  —  mXn  —  6) 

The  values  of  m  and  b  must  be  so  determined  that  the  sum  of  the 
squares  of  the  F-residuals  is  a  minimum.  We  will  therefore  square 
each  residual  and  find  the  sum.   We  have 

p?  =  F?  +  m'Xl  +  ¥  -  2mXiYi  -  26 Fi  +  2bmXi 
pi  =  Fi  +  m'^Xl  +  62  -  2mX2F2  -  26 F2  +  26mX2 


p2  =  F2  +  m^Xl  +  ¥  -  2mXnYn  -  26 Fn  +  26wXn 

Adding  the  above  equations,  we  express  Sp^  as  a  quadratic  in  6 
and  as  a  quadratic  in  m.  Using  the  sigma  notation,  after  careful 
rearrangement  of  terms,  we  obtain: 

a.  2p2  =  n62  +  2[m2X  -  2F]6  +  [m^SZ^  -  2mSZF  +  SF^] 

b.  Sp2  -  m2SX2  +  2[6SX  -  SXF]m  +  [2F2  -  262F  +  n¥2 

From  equation  a.,  2p-  is  a  minimum  when 


-  2[mSX  -  2F]      -  mSX  +  SF 


2n  n 


or  niliX  +  nb  =  SF 

and  from  equation  b.,  Sp^  is  a  minimum  when: 

^      2[6SX  -  SXF]  ^  -  6SZ  +  SXF 


or 


mSX2  +  6SX 


SXF 


These  equations 


mSX  +  n& 
?nSX^  +  bSX 


SF 
SXF 


(4) 
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are  the  normal  equations}   Note  that  the  first  can  be  written  by 
summing  the  equation  Y  =  mX  +  b,  and  that  the  second  can  be 
written  by  multiplying  Y  =  mX  +  bhy  X  and  summing  the  result. 
Solving  the  normal  equations  simultaneously,  we  have: 

TiZXY  -  S  J2F 


m  = 


b  = 


(5) 


nZX^  -  (S^)2 
and  for  the  best-fitting  straight  line 

_  (nXXY  -  2ArSF\       SJ^SF  -  S^JTF  .  , 

^     [nZX^  -  {i:X)^r        TiEX^  -  (2Z)2 
The  line  given  by  (6)  is  sometimes  called  the  line  of  regression  ^  of 
Yon  X. 

Let  us  use  the  following  tabular  arrangement  to  compute  the 
coefficients  m  and  b  in  (5)  and  to  compare  the  computed  values  of 
Y  with  the  observed  values. 

Table  50 


X 

(1) 

Observed  Y 
(2) 

X2 

(3) 

(4) 

Computed  Y 
(5) 

(6) 

(Y'Residuals)^ 
(7) 

2Z 

SZF 

^  The  student  familiar  with  the  calculus  would  derive  these  equations  much 
more  quickly.   Thus,  if 

Pi  =  Ft  —  {mXi  +  5)  =  the  ith  F-residual, 
then:  S/cJ  -  2(F»  -  mXi  -  6)^ 

The  values  of  m  and  6  for  which  Zp^  is  a  minimum  are  obtained  by  setting  the 
partial  derivatives  of  Zp^  y^rith  respect  to  m  and  b  each  equal  to  zero.  We  then 

"^'"^"^  mSX  +      =  SF 

mZX^  +  bZX  =  2XF 

*  The  line  of  regression  of  X  on  F  may  be  obtained  by  minimizing  the  sum  of 
the  squares  of  the  X-residuals  of  the  line  X  =  mF  +  6.  The  properties  of  this 
line  will  be  summarized  in  Section  66  (p.  248). 
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In  connection  with  Table  50,  the  following  procedure  is  recom- 
mended for  numerical  problems  ^: 

1.  Compute  the  values  for  columns  (3)  and  (4). 

2.  Find  2X,  2F,  SX^,  and  2X7. 

3.  Substitute  in  (4),  solve  for  m  and  6,  and  obtain  the  equation  of  the 
best-fitting  straight  line. 

4.  Substitute  the  values  of  X  from  column  (1)  into  the  equation  of  the 
straight  line  and  obtain  the  computed  or  most  probable  values  of  7. 
This  completes  column  (5). 

5.  Complete  columns  (6)  and  (7)  and  thus  find  2p  and  Sp^. 

EXERCISE 

Apply  the  foregoing  suggestions  to  the  data  of  Table  46  and  Table  48. 

B.  The  Method  of  Moments.  Another  very  widely  used  method 
for  fitting  a  theoretical  curve  to  observed  data  is  the  method  of 
moments. 

Let  (Xi,Fi),  (X2,F2),  .  .  (XnjYn)  be  the  n  points  determined 
by  n  sets  of  observed  data.  If  the  selected  curve  is  denoted  by 
Y  =  /(X),  the  theoretical  values  of  Y  are  /(Xi),  /(X2),  .  .  f(Xn). 
The  principle  of  moments  (see  Section  42,  p.  159)  says  that  we  shall 
obtain  a  good  fit  if  the  i^th  moment  about  OF,      =  0,  1,  2,  .  .  ., 

—  1),  of  the  n  observed  values  of  Y  equals  the  corresponding  ^th 
moment  about  OF  of  the  n  theoretical  values  of  F,  k  being  the 
number  of  undetermined  constants  in  the  given  equation.  That  is, 
for  /  =  0  we  have  the  zeroth  moments: 

S  observed  Y  -  X  theoretical  Y 

or  n  n 

SF,  =  2/(X,) 
i=i  1=1 

For  ^  =  1  we  have  the  first  moments: 

SX,F,  =  XXifiXi) 
1=1  1=1 

For  ^  =  2  we  have  the  second  moments : 

2X]Yi  =  hc\f(Xi) 
and  so  on.  ^"^ 

Equations  (5)  and  (6)  are  useful  for  theoretical  problems  whereas  equations 
(4)  are  better  for  numerical  problems- 
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i'lGURE  35 


JJ  U  L. 


When  the  curve  Y  =  f{X)  is  a  straight  Hne  Y  =  mX  +  6,  we  have 

SF  =  S(mX  +      =  mSX  +  n6 
and  SXF  =  SX(mX  +  6)  =  mSZ^  +  6SX 

which  are  the  same  equations  as  (4).^  Evidently  the  suggestions 
following  Table  50  apply  here. 

EXERCISES 

1.  Find  the  equation  of  the  straight  line  whic  '1  best  fits  the  temperature 
resistance  data  of  Table  42  (p.  211). 

2.  The  lengths,  Z,  attained  by  a  certain  coiled  spring  made  of  steel  wire, 
corresponding  to  different  weights,  supported  by  the  spring  were  as 
shown  in  the  following  table.  The  lengths  were  measured  in  centimeters 
and  the  weight  in  grams.  Find  the  linear  relation  in  the  form  I  =  mw  +  b. 

1  It  can  be  shown  (see  ''The  Method  of  Moments"  by  Dunham  Jackson, 
American  Mathematical  Monthly,  September,  1923)  that  if  f{X)  is  a  polynomial, 
the  method  of  moments  gives  the  same  solution  a^*  the  method  of  least  squares. 
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Lengths  and  Weights  op  Spring 


w 

I 

w 

I 

100 

92.2 

400 

98.3 

200 

94.3 

500 

100.2 

300 

96.2 

600 

102.3 

3.  Compute  the  length  of  the  spring  in  the  table  of  Exercise  2  for  all 
weights  at  intervals  of  50  grams  from  50  grams  to  650  grams. 

4.  Show  that  the  point  {Mx,  My)  is  on  the  line  (6),  that  is,  show  that  the 
best-fitting  line  passes  through  the  centroid  of  the  points. 

5.  Using  the  values  of  m  and  b  given  in  equation  (5),  show  that  the  sum 
of  the  F-residuals  for  Y  =  mX  -f  6  is  equal  to  zero. 


60. 


THE  STRAIGHT  LINE  WITH  THE  ORIGIN 
AT  THE  CENTROIDAL  POINT 


Figure  36 


Figure  37 


The  theorem  contained  in  Exer- 
cise 4  above  states  that  the  best- 
fitting  straight  line  passes  through 
the  centroidal  point  (Mx,  My).  Using 
this  point  as  origin,  the  equation  of 
the  line  takes  a  much  simpler  form 
and  our  further  mathematical  treat- 
ment is  greatly  simplified. 

Denote  the  centroidal  point  by  C. 

If  A'',  Y  is  any  pair  of  numbers 
referred  to  zero  as  origin,  their  values 
referred  to  C  as  origin  are  given  by: 


X  =  X  Mx\ 
y  =  F  -  My  j 


(7) 


The  equation  of  the  line  referred 
to  the  new  origin,  C,  is  of  the  form 

y  mx 

since  the  ^/-intercept  is  zero. 


The  tabulated  data  now  take  the  following  form: 
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Table  51 


X 

Y 

X  -  Mx 

y  -My 

Fi 

yi 

•  • 

•  • 

Xi 

«  • 

•  • 

•  • 

•  • 

Yn 

•  • 

Xn 

•  • 

Corresponding  to  a:  =  xi,  the  ordinate  of  the  Hne  is  mxi)  corre- 
sponding to  X  =      the  ordinate  of  the  line  is  mx2,  etc.  Hence: 

Pi  =  1st  ^/-residual  =  yi  —  mxi 
P2  =  2nd  i/-residual  =  2/2  ~ 


Pn  =  nth  y-residual  =  2/n  — 

We  wish  to  determine  m  so  that  the  sum  of  the  squares  of  the  y- 
residuals  is  a  minimum.  Evidently : 

2p?  =        -  mxi)^  =  m^Sx?  -  27nSx,y,  +  Sy?  (8) 

The  value  of  m  for  which  SpJ  is  a  minimum  is,  omitting  subscripts: 

"■=1?  <« 

and  the  best-fitting  line,  referred  to  the  axes  through  C,  is: 

(10) 


-If) 


If  we  replace  x  and  y  by  their  values  in  (7),  we  obtain  the  equation 
of  the  line  referred  to  axes  through  0(0,  0)  as  origin,  namely: 

Y-Mr  =  ^iX-Mx)  (11) 

We  shall  illustrate  the  procedure  to  be  followed  should  one  decide 
to  fit  a  least-squares  line  by  the  x,  y  method.  We  shall  use  data 
that  we  have  previously  considered  by  the  X,  Y  method.  After 
computing  Mx  =  4  and  My  =  11,  we  complete  the  table  (see 
Table  52)  by  finding  the  quantities  suggested  by  (9).  The  algebraic 
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work  follows  the  table,  and  we  obtain  of  course  the  same  equation 
that  we  found  on  page  216. 

Table  52 


X 

Y 

a;  -  X  -  4 

2/  =  r  -  11 

xy 

1 

5 

-  3 

-  6 

18 

9 

3 

10 

-  1 

- 1 

1 

1 

5 

13 

1 

2 

2 

1 

7 

16 

3 

5 

15 

9 

Mx  =  4 

My  =  11 

36 

20 

m  = 


36 
20 


=  ^  =  1.8 


or 


y 

Y 
Y 


1.8x     (Equation  of  line  through  C  as  origin) 
11  =  1.8(X  -  4) 

1.8X  +  3.8     (Equation  of  line  referred  to  axes 

through  0  as  origin) 


We  have  thus  developed  two  methods  of  finding  the  equations  of 
the  least-squares  line  determined  by  a  set  of  data.  We  may  de- 
termine m  and  b  for  the  line  Y  =  mX  +  6  by  using  the  normal 
equations  (4)  with  the  X,  Y  data,  or  we  may  determine  m  for  the 
line  y  =  mx,  where  x  and  y  are  the  deviations  of  X  and  Y  from  their 
respective  means:  x  —  X  —  Mx,  y  =  Y  —  My]  then  replacing  x 
and  y  by  their  values  we  obtain  the  X,  Y  equation. 

The  second  method  is  preferred  for  numerical  problems  when  the 
values  of  x  and  y  are  such  that  the  arithmetical  operations  upon 
them  are  simpler  than  when  X  and  Y  are  used.  Thus,  if  the  Xy  Y 
data  are  integral  and  Mx  and  M  y  are  integral,  the  values  of  x  and  y 
will  be  integral  and  then  the  table  for  finding  ni  is  decidedly  simple 
to  construct.  If  Mx  and  My  are  decimals  and  the  values  of  x  and  y 
are  decimals,  the  second  method  is  to  be  discouraged. 

However  for  theoretical  purposes  the  results  of  the  second,  or 
X,  method  are  important  and  the  contents  of  Section  60  should 
be  mastered. 

Let  us  consider  the  data  of  Table  53.  We  wish  to  find  the  Z,  Y 
equation  for  these  data. 
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Table  53.  The  Index  Numbers  of  Retail  Prices  of  10  Articles 

OF  Food  at  Two  Different  Years 


Article 

1st  year 
X 

2nd  year 
Y 

XY 

X 

y 

xy 

1 

88 

82 

7,744 

7,216 

4 

5 

16 

20 

2 

77 

71 

5,929 

5,467 

—  7 

—  6 

49 

42 

3 

91 

82 

8,281 

7,462 

7 

5 

49 

35 

4 

75 

70 

5,625 

5,250 

-  9 

-  7 

81 

63 

5 

95 

87 

9,025 

8,265 

11 

10 

121 

110 

6 

83 

77 

6,889 

6,391 

-  1 

0 

1 

0 

7 

85 

77 

7,225 

6,545 

1 

0 

1 

0 

8 

82 

77 

6,724 

6,314 

-  2 

0 

4 

0 

9 

84 

73 

7,056 

6,132 

0 

-  4 

0 

0 

10 

80 

74 

6,400 

5,920 

-  4 

-  3 

16 

12 

Total 

840 
Mx  =  M 

770 
My  =  77 

70,898 

64,962 

0 

0 

338 

282 

Using  the  X,  Y  values  with 

Y  =  mX  +  b 

the  normal  equations  are 

840m  +   106  =  770 
70,898m  +  8406  =  64,962 

Eliminating  6  we  obtain 

70,560m  +  8406  =  64,680 
70,898m  +  8406  =  64,962 


338m 


=  282 
m  =  0.83 


Substituting  we  find 

6  =  7.28 
and  our  X,  Y  equation 

Y  =  0.83X  +  7.28 


Using  the  x,  y  values  with 
y  =  mx 

the  normal  equation  is 

i:xy  282 

jn  =  =  

Xx'  338 

m  =  0.83 

Our  Xj  y  equation  is 

y  =  0.83x 

and  our  X,  Y  equation  is 

y  -  77  =  0.83(X  -  84) 

or,  simplifying, 

Y  =  0.83X  +  7.28 


Obviously  the  y  method  leads  to  the  solution  more  simply  than 
the  X,  Y  method. 

The  student  who  has  been  impressed  with  the  power  of  the  x' 
method  when  computing  M ,  cr,  a^,  and  ai  will  naturally  wonder  if 
this  method  cannot  be  employed  to  advantage  in  this  work  of  fitting 
straight  lines  to  data.  We  assure  him  that  the  method  is  an  excellent 
one  and  we  shall  present  it  in  the  next  chapter. 
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EXERCISES 

1.  Find  the  Z,  Y  equation  of  the  least-squares  line  for  each  of  the  fol- 
lowing sets  of  data: 


(a) 


(b) 


(c) 


X 

Y 

X 

Y 

X 

Y 

1 

2 

5 

12 

10 

4 

8 

4 

7 

15 

8 

5 

14 

8 

11 

26 

6 

7 

15 

9 

13 

33 

4 

8 

22 

12 

14 

34 

2 

11 

(d) 


X 

Y 

2 

47 

4 

43 

6 

41 

10 

37 

13 

31 

15 

26 

20 

20 

2.  In  the  following  table  S  is  the  weight  of  potassium  bromide  which 
will  dissolve  in  100  grams  of  water  at  T°  C.  Find  the  relation:  S  =  mT 
+  b.   Use  this  equation  to  estimate  *S  when  T  =  50°. 


rn 


s 


0     20      40      60  80 


54     65     75     85  96 


3.  In  the  following  table 

X  =  scores  of  ten  students  on  a  standardized  test  in  secondary  algebra 

taken  at  the  beginning  of  college 
Y  —  semester  grades  of  the  same  students  in  college  algebra 

Find  by  two  methods  the  least-squares  line  for  these  data.  Based  upon 
these  data,  estimate  the  semester  grade  of  a  student  who  made  60  on  the 
standardized  test. 


X 

Y 

X 

Y 

54 

67 

90 

91 

56 

68 

63 

74 

64 

74 

47 

52 

33 

48 

92 

90 

57 

69 

34 

47 

4.  A  biologist  found  that  the  length  (in  centimeters)  of  intestines  of 
birds  and  their  weight  (in  grams)  were  linearly  related.  Find  the  relation 
W  —  mh  +  h  for  the  data  given  in  the  following  table. 
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Average 
Length  of 
Intestines 
L 

Average 
Weight  of 
Bird 
W 

Average 
Length  of 
Intestines 
L 

Averaae 
Weight  of 
Bird 
W 

4.3 

1.5 

9.7 

6.5 

5.8 

2.7 

10.2 

7.3 

6.5 

3.6 

11.0 

8.1 

7.3 

4.2 

11.6 

8.8 

8.4 

5.4 

12.4 

9.7 

9.0 

 ' — '  ■ — 

5.9 

12.6 

9.8 

6.  The  latent  heat,  L,  of  steam  in  calories  is  given  for  various  values  of 
the  temperature,  T.  Find  the  equation  of  the  best-fitting  line  for  L  in 
terms  of  T. 


T 

L 

T 

L 

70 

556 

110 

530 

80 

550 

120 

523 

90 

542 

130 

515 

100 

536 

What  is  the  value  of  L  when  T  =  75? 

Compare  the  computed  and  the  observed  values  of  L  for  the  given  values 
of  T, 

61.   FITTING  A  STRAIGHT  LINE  TO  A  TIME  SERIES 

In  Section  17  (p.  43)  we  encountered  series  in  which  time  is  the 
independent  variable.  Several  time  series  were  tabulated  in  Tables 
10,  11,  12  (pp.  44-46),  and  their  graphical  representations  were  ex- 
hibited in  Charts  4,  5,  and  6.  Further  attention  to  time  series  has 
been  reserved  for  this  chapter  because,  after  the  graphical  representa- 
tion, the  next  step  in  the  analysis  is  the  determination  of  the  long-time 
trend,  frequently  called  the  secular  trendy  and  this  is  usually  accom- 
plished by  fitting  a  straight  line  to  the  data.  The  straight  line,  of 
course,  should  be  fitted  only  to  those  series  which,  over  a  long  period, 
show  a  general  movement  in  one  direction,  that  is,  a  general  tendency 
to  increase  or  to  decrease. 

Over  a  considerable  period  of  time,  many  social  and  economic 
phenomena  show  a  definite  tendency  to  grow  or  to  deeline,  that  is, 
they  show  a  definite  trend.    For  example,  the  population  of  the 
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United  States  (see  Table  10)  shows  a  definite  tendency  to  increase, 
while  the  percentage  decade  rate  of  growth  is  constantly  declining. 
The  production  of  lumber  in  the  United  States  (see  Table  11)  from 
1909  to  1922  shows  a  definite  tendency  to  decline.  While  the  secular 
trend  is  usually  described  by  means  of  a  Unear  function,  it  must  not 
be  supposed  that  all  definite  trends  are  so  simply  described.  Popula- 
tion data,  for  example,  frequently  require  curves  with  rather  complex 
equations  for  their  description. 

The  fact  should  be  emphasized  that  the  secular  trend  is  concerned 
with  the  regular,  long-term  movements.  True,  over  a  short  period  of 
time  the  movements  may  vary  spasmodically,  but  the  general  trend 
is  upward  or  downward.  We  are  not  concerned  here  with  the  sea- 
sonal variatiorLs  that  are  so  characteristic  of  time  series,  but  with  the 
secular  trends,  and  only  those  that  can  be  described  by  linear  func- 
tions. 

The  computed  or  trend  value  of  Y  at  any  date  is  taken  as  the 
normal  value  at  that  date.  It  is  viewed  as  the  value  that  would  obtain 
if  all  temporary  and  accidental  forces  were  eliminated.   The  equation 


Table  54.  The  Pkoduction  of  Lumber  in  the  United  States: 

Computing  the  Secular  Trend  ^ 


Year 

X 

Production  in 
Board  Feel 
{billions) 
Y 

X2 

XY 

Computed 
Y 

P 

1909 

-  6 

44.5 

36 

~  267.0 

42.1 

2.4 

1910 

-  5 

40.0 

25 

-  200.0 

41.2 

-  1.2 

1911 

-  4 

37.0 

16 

-  148.0 

40.3 

-  3.3 

1912 

-  3 

39.2 

9 

-  117.6 

39.4 

-  0.2 

1913 

-  2 

38.4 

4 

-  76.8 

38.5 

-  0.1 

1914 

-  1 

37.3 

1 

-  37.3 

37.6 

-  0.3 

1915 

0 

37.0 

0 

000.0 

36.7 

0.3 

1916 

1 

39.9 

1 

39.9 

35.8 

4.1 

1917 

2 

35.8 

4 

71.6 

34.9 

0.9 

1918 

3 

31.9 

9 

95.7 

34.0 

-  2.1 

1919 

4 

34.6 

16 

138.4 

33.1 

1.5 

1920 

5 

33.8 

25 

169.0 

32.2 

1.6 

1921 

6 

27.0 

36 

162.0 

31.3 

-  4.3 

1922 

7 

31.6 

49 

221.2 

30.5 

1.1 

Total 

7 

508.0 

231 

51.1 

0.4 

1  The  data  are  taken  from  Statistical  Abstract  of  the  United  States,  1928,  p.  689. 
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of  the  trend  line  from  which  trend  values  are  computed  is  merely  a 
summarizing  expression  for  the  large  group  of  data  upon  which  it  is 
based,  and  therefore  may  be  used  for  making  estimates  of  values 
within  the  period  but  may  not  be  at  all  appUcable  for  making  fore- 
casts and  predictions.  Some  new  factor  may  enter  at  any  time  and 
disturb  the  trend.  Therefore,  when  a  trend  line  is  used  for  extra- 
polation —  that  is,  for  computing  values  outside  the  given  abscissal 
range  —  the  impUcations  of  the  line  beyond  the  period  of  record 
should  be  carefully  checked  against  every  possible  evidence  that  may 
influence  the  factor  in  question. 

The  method  of  fitting  a  straight  line  to  a  time  series  is  illustrated  in 
Table  54.  Our  problem  here  is  to  find  the  equation  of  the  trend  line 
for  the  production  of  lumber  in  the  United  States,  the  data  for  which 
were  given  in  Table  11,  and  graphically  presented  in  Chart  5. 

While  the  origin  for  X  may  be  chosen  at  any  point,  for  the  sake 
of  simple  computation  it  should  be  taken  at  or  near  the  center.  If 
an  odd  number  of  years  is  under  consideration,  it  should  be  taken  at 
the  middle  year  of  the  period.   If  X  =  0  at  1915,  we  have 

n  =  14  2:Z2  =  231 

SX  =  7  2XF  =  51.1 

SF  =  508 

Using  formulas  (5),  we  find  m  and  b: 

14(51.1)  -  7(508)  _ 
^  =     14(231)  -  49     "  ^  ^'^^^ 
,      231(508)  -  7(51.1)  _ 
^  -  ~"14(231)  -  49  ^^-^^^ 

The  equation  of  the  straight  line  which  gives  the  secular  trend  m 
therefore 

F  =  -  0.892X  +  36.73 

from  which  the  computed  values  and  the  residuals  can  be  found. 

Other  methods  for  treating  time  series  will  be  found  in  Sections  81 
and  87.  While  the  methods  we  shall  present  in  these  later  section© 
make  possible  the  determination  of  the  constants  m  and  b  with  less 
arithmetical  tedium,  we  shall  present  no  method  that  surpasses  in 
precision  and  reliability  that  based  upon  the  principle  of  least  squares. 
In  addition  to  the  three  important  properties  (Can  you  name  them?) 
to  which  we  have  referred  —  casually  perhaps  —  the  least-squares 
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line  has  the  enthusiastic  approval  of  the  theory  of  probabiUty. 
We  cannot  say  so  much  for  any  other  Une, 

EXERCISES 

1.  The  following  table  gives  the  annual  production  of  Portland  Cement 
in  the  United  States.  \^Staiistical  Abstract  of  the  United  States,  1930,  p.  785.] 
Find  the  least-squares  line  for  these  data. 


Year 

Production 
{millions  oj  barrels) 

Year 

Production 
{millions  of  barrels) 

1910 

77 

1920 

100 

1911 

79 

1921 

99 

1912 

82 

1922 

115 

1913 

92 

1923 

137 

1914 

88 

1924 

149 

1915 

86 

1925 

161 

1916 

92 

1926 

165 

1917 

93 

1927 

173 

1918 

71 

1928 

176 

1919 

81 

1929 

171 

2.  In  the  following  table  Y  gives  the  average  weekly  earnings  of  shop 
and  office  employees  in  representative  New  York  State  factories. 


Year 

X 

Y 

Year 

X 

Y 

1914 

-  3 

$12.48 

1918 

1 

$20.35 

1915 

-  2 

12.85 

1919 

2 

23.50 

1916 

-  1 

14.43 

1920 

3 

28.15 

1917 

0 

16.37 

1921 

4 

(1)  Find  the  equation  of  the  least-squares  line  for  these  data. 

(2)  Based  upon  this  line  what  were  the  predicted  average  weekly  earn- 
ings in  1921?   The  actual  average  weekly  earnings  were  $25.72. 

3.  In  the  following  table  Y  gives  (in  millions  of  dollars)  the  net  earnings 
»>f  the  Associated  Gas  and  Electric  System,  1920-1928. 


Year 

X 

Y 

Year 

X 

Y 

1920 

-  4 

13.4 

1925 

1 

29.5 

1921 

-  3 

16.2 

1926 

2 

33.5 

1922 

-  2 

19.2 

1927 

3 

37.8 

1923 

-  1 

22.7 

1928 

4 

40.6 

1924 

0 

25.1 

1929 

5 

•  •  • 
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(1)  Find  the  equation  of  the  least-squares  line  for  these  data. 

(2)  Based  upon  this  line,  what  were  the  predicted  earnings  for  1929? 
The  actual  net  earnings  were  48.5  millions. 

4.  The  number  of  mules  on  farms  in  the  United  States  for  the  given 
years  is  shown  in  the  following  table.  Choose  X  =  0  at  1932  and  find  the 
least-squares  line  for  the  data. 


Year 

Mules 
(millions) 

Year 

Mules 
(millions) 

1926 

5.9 

1933 

5.0 

1927 

5.8 

1934 

4.9 

1928 

5,7 

1935 

4.8 

1929 

5.5 

1936 

4.7 

1930 

5.4 

1937 

4.6 

1931 

5.3 

1938 

4.4 

1932 

5.1 

1939 

•  •  • 

REVIEW  EXERCISES 

1.  Use  the  relations  X  =  x  +  Mx,  Y  =  y  +  My  with  the  value  of 
m  given  by  (5)  page  218  and  thus  obtain  the  value  of  m  given  by  (9) 
page  222. 

2.  State  the  three  most  important  properties  of  the  least-squares  line 
fitting  a  swarm  of  points. 

3.  Given  a  set  of  variates,  what  is  the  algebraical  sum  of  the  deviations 
of  these  variates  from  their  M^? 

4.  Given  a  set  of  variates,  from  what  value  is  the  sum  of  the  squares 
of  the  deviations  least? 

5.  Given  a  set  of  variates  distributed  normally,  what  per  cent  of  the 
variates  he  within  the  interval  Mx  ±  o-^?  within  the  interval  Mx  ±  2(rx? 
within  the  interval  Mx  ±  3crx? 

6.  When  is  it  advisable  to  use  the  method  of  Section  60  to  find  the 
equation  of  the  least-squares  line? 

7.  What  is  the  unit  of  measurement  of  a  F-residual?  of  a  (F-residual)^? 
of  S(y-residual)2  or  Sp^? 

8.  Find  Sp*  for  the  data  on  lumber  production,  Table  54,  including 
the  unit  of  measurement. 

9.  Do  you  think  Sp^/n  can  be  used  to  measure  the  goodness  of  fit 
of  a  curve  fitted  to  a  swarm  of  points?  In  what  unit  would  it  be  expressed? 

/2p2 

10.  What  about  V/ " —  as  a  measure  of  the  goodness  of  fit?   In  what 

V  n   

/2p2  2p2 
unit  would  it  be  expressed?   Do  you  think  V/  —  superior  to  —  as  a 

M    n  n 

measure  of  goodness  of  fit?  Why? 
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11.  (a)  Show,  for  the  data  of  lumber  production,  Table  54,  that 

4  / —  =  2.15  billions  of  board  feet. 
V  n 

(b)  How  many  values  of  p  in  Table  54  are  numerically  less  than  2.15? 

(c)  How  many  values  of  p  are  numerically  less  than  2(2.15)? 

12.  Can  you  think  of  any  method  whereby  we  may  compare  the  close- 
ness of  fit  of  straight  lines  fitted  to  data  expressed  in  different  units,  say 
Tables  53  and  54? 
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SIMPLE  CORRELATION 

62.  MEASURES  OF  CONCENTRATION  OF  POINTS 
ABOUT  THE  LINE  OF  REGRESSION 

In  the  preceding  chapter  we  have  devoted  considerable  attention 
to  the  problem  of  securing  a  linear  mathematical  expression  for 
the  relationship  between  two  variables.  This  equation,  called  the 
line  of  regression  of  Y  on  expresses  mathematically  the  average  re- 
lationship between  the  variables.^  In  the  exercises  and  the  illustra- 
tive examples  that  we  have  considered,  the  points  have  clustered 
closely  about  the  regression  line.  But  a  line  of  definite  equation 
may  be  fitted  to  points  that  are  quite  scattered,  widely  dispersed 
with  respect  to  the  line.  A  question  immediately  presents  itself: 
How  can  we  measure  the  closeness  with  which  the  points  cluster 
about  the  line?  Can  we  find  a  measure  of  the  degree  of  the  relation- 
ship between  the  two  variables? 

This  problem  is  similar  to  that  which  arose  in  connection  with  the 
measures  of  central  tendency.  We  desired  to  know  how  great  was 
the  concentration  of  the  measures  of  a  distribution  about  their  mean. 
To  measure  this  concentration  we  built  up  several  measures  of  dis- 
persion, recommending  especially  the  standard  deviation,  which  is 
the  square  root  of  the  mean  of  the  squares  of  the  deviations  from  the 
arithmetic  mean. 

The  line  of  regression  possesses  two  important  properties  that  are 
analogous  to  similar  properties  of  the  arithmetic  mean.  The  arith- 
metic mean  is  the  value  such  that  the  sum  of  deviations  from  it  is 
zero  (see  Exercise  5,  p.  68);  the  regression  line  enjoys  the  property 
that  the  sum  of  the  residuals  from  it  is  zero  (see  Exercise  5  on  p.  221). 
The  arithmetic  mean  is  the  value  such  that  the  sum  of  the  squares 
of  the  deviations  from  it  is  a  minimum  (see  p.  131);  the  regression 
line  enjoys  the  property  that  the  sum  of  the  squares  of  the  residuals 
from  it  is  a  minimum  (the  principle  of  least  squares). 

*  The  line  of  regression  of  X  on  F  will  be  considered  in  Section  66  (p.  247). 
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Owing  to  the  fact  that  the  line  of  regression  possesses  the  two 
properties  mentioned  before,  it  is  frequently  called  the  line  of  means. 
This  name  will  be  more  adequately  justified  in  Section  67  (p.  254). 
Consequently,  just  as  we  used 


(TX 


(where  N  is  the  total  frequency)  to  measure  the  concentration  of  the 
observed  X  measures  about  their  mean,  so  we  use 

S„  =  (1) 


(where  n  is  the  number  of  pairs  of  values  of  X  and  Y  and  where 
Pi  =  observed  Ft  —  computed  Yi)  to  measure  the  concentration 
of  the  observed  Y  measures  about  their  Hne  of  means.  Sy  is  called 
the  standard  error  of  estimate.  One  method  of  obtaining  Sy  is  illus- 
trated in  Table  47  (p.  215)  and  Table  50  (p.  218). 
In  Table  47  we  have: 

Sp2  =  1.20      and       n  =  4 

therefore 


S^^\Jh^  ^  0.54772 


EXERCISE 


Find  Sy  for  Exercises  1  and  2  on  page  229. 

It  is  evident  that  the  F-residuals  and  Sy  arc  expressed  in  the  given 
F-unit.    To  interpret  Sy  intelligently  requires  a  knowledge  of  the 
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properties  of  a  normal  surface.  It  is  sufficient  at  this  point  to  state 
that  for  a  distribution  of  sufficient  size  to  approximate  the  normal 
form,  about  two-thirds  of  the  points  will  He  in  a  strip  bounded  by  two 
lines  on  either  side  of  and  parallel  to  the  regression  line,  72/2,  and  a 
vertical  distance,  Sy,  from  it.  That  is,  the  odds  are  2  to  1  that,  for 
a  given  X,  the  observed  Y  will  lie  within  the  zone: 

(computed  Y)  ±  Sy 

Similarly,  a  zone  established  by  drawing  lines  on  either  side  of 
and  parallel  to  RR  and  a  vertical  distance  2Sy  from  it  will  include 
about  95  per  cent  of  the  points.  That  is,  the  odds  are  95  to  5  or 
19  to  1  that,  for  a  given  X,  the  observed  Y  will  lie  within  the  zone: 

(computed  Y)  it  2Sy 

If  the  zone  is  further  enlarged  —  say  3Sy  vertically  from  RR  above 
arid  below  —  it  is  practically  certain  (odds  385  to  1)  that  an  ob- 
served Y  will  lie  within  the  interval 

(computed  Y)  zt  3Sy 

Let  us  illustrate  these  statements  graphically.  On  Figure  39  we 
have  plotted  thirty  points  which  represent  graphically  thirty  (X,  Y) 
sets  of  observed  data.  To  these  data  we  have  fitted  the  regression 
line  RR.  It  will  be  noted  that  twenty  of  the  points  lie  within  the 
zone  determined  by  the  parallels  to  RR  and  db  Sy  from  it.  Twenty- 
eight  are  within  the  area  determined  by  the  parallels  to  RR  and 
zk  2Sy  from  it.  Only  two  of  the  points  are  outside  the  latter  area. 


Figure  39 


X 
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Let  us  look  at  this  matter  somewhat  differently.  It  is  recalled 
that  the  line  of  regression  may  be  used  for  purposes  of  estimating  Y 
for  given  values  of  X.  (When  X  is  mthin  the  given  abscissal  range,  we 
estimate  Y  from  the  equation;  when  X  is  outside  the  given  abscissal 
range,  we  predict  Y  from  the  equation.)  Thus,  the  computed  value  of 
Y  may  be  an  estimate  or  a  prediction.  In  the  language  of  probability, 
given  an  X,  the  equation  gives  the  best  or  most  probable  value  of  F. 

When  we  use  the  regression  equation  to  make  estimates  or  predic- 
tions, we  naturally  are  eager  to  know  the  degree  of  confidence  to 
put  in  our  results.  Suppose  we  choose  an  X  and  compute  F.  The 
odds  are  2  to  1  that  the  observed  F  will  not  differ  numerically  from 
the  computed  F  by  more  than  Sy.   Thus,  for  Table  54,  we  have 

F  =  -  0.892Z  +  36.73 

billions  of  board  feet,  and  Sy  =  2.15  billions  of  board  feet.  Let 
X  =  5.  We  find  F  =  32.3  billions  of  board  feet.  The  odds  are 
2  to  1  that  this  value  does  not  differ  from  the  observed  F(=  33.8) 
by  more  than  Sy{=  2.15).  That  is  the  odds  are  2  to  1  that  the 
observed  F  is  witliin  the  interval  32.3  ±2.15  bilUons  of  board  feet. 

Of  course  if  the  student  wishes  to  do  so,  he  may  use  the  probable 
error  of  estimate  instead  of  the  standard  error  as  a  measure  of  the 
reliability  of  liis  estimate.  Since  the  probable  error  of  any  parameter 
is  0.6745  times  the  standard  error  of  the  parameter,  we  have 

Probable  error     Ci  a^At^  /Standard  error\     ^  a^iAno 
of  estimate      =  ^"^^^^  lof  estimate     j  = 

There  is  obviously  a  consequent  change  of  language.  In  this  case 
the  chances  are  even  that  for  a  given  X  the  observed  F  will  not  differ 
from  the  estimated  F  by  more  than  ±  0.6745>Sy. 

EXERCISES 

1.  Show  that  Sp^  =  ^[Y  -  (mX  +  b)y  =  ZF"^  -  bZY  -  mSXF. 
Hint:  Make  use  of  the  normal  equations  (4),  page  217. 

2.  (a)  Using  Exercise  1  above,  show  that 


What  sigma  (2)  function,  not  used  in  finding  m  and  6,  is  needed  to  find 
Sy  from  formula  (1')? 
(b)  Prove:  Sy  =  <7p. 
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3.  The  following  data  are  taken  from  The  World  Almanac^  1935,  pp.  292 
and  310. 

X  =  Savings  Bank  Deposits  in  the  U.S.,  1918-1933. 

Y  =  Number  of  Strikes  and  Lockouts  in  the  U.S.,  1918-1933. 


Year 

Sav.  Bk.  Dev. 
(billions  of  $  s) 

X 

S.  and  L.O. 

(thousands) 
Y 

U.rt 

o.o 

^  Q 

o.o 

X  \J£iX 

xJ.O 

1Q99 

7  1 

1  1 

1.1 

1923 

7.7 

1.6 

1924 

8.2 

1.2 

1925 

8.9 

1.3 

1926 

9.3 

1.0 

1927 

9.5 

0.7 

1928 

10.0 

0.6 

1929 

10.1 

0.9 

1930 

10.4 

0.7 

1931 

11.0 

0.9 

1932 

10.9 

0.8 

1933 

10.4 

1.6 

(1)  Find  the  equation  of  the  re- 
gression line. 

(2)  Interpret  the  value  of  m. 

(3)  Find  Sy  using  formula  (1"), 
and  interpret  it. 

(4)  If  X  =  8  find  y.  Using  >S^ 
interpret  your  result. 

(5)  In  1935,  .Y  =  10.6.  Com- 
pute Y  and  conii)are  with  the 
actual  Y  =  2.0. 


4. 


X 

Y 

12.5 

74 

19.8 

170 

17.3 

147 

9.9 

57 

10.9 

75 

7.5 

46 

13.7 

130 

13.1 

89 

8.5 

59 

3.8 

20 

11.9 

90 

8.6 

74 

12.1 

41 

11.9 

77 

15.6 

144 

In  the  adjacent  table 

X  =  value  of  crops  (dollars  per  acre) 
Y  =  value  of  land  and  buildings  (dollars  per  acre)  in 
fifteen  counties  of  Illinois  in  1930. 

(1)  Find  the  equation  of  the  regression  line. 

(2)  Interpret  the  value  of  m. 

(3)  Compute  Sy  by  formula  (1'). 

(4)  Compute  7  for      =  10,  and  interpret  your  result 
with  the  aid  of  Sy, 
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5.  In  the  following  table 

X  =  average  yield  (bushels  per  acre)  of  corn,  1910-1919. 
Y  —  average  land  value  (dollars  per  acre)  on  January  1,  1920  in 
twenty-five  counties  of  Iowa. 


X 

Y 

X 

Y 

40 

87 

41 

193 

36 

133 

38 

203 

34 

174 

38 

279 

41 

285 

34 

179 

39 

263 

45 

244 

42 

274 

34 

165 

40 

235 

40 

257 

31 

104 

41 

252 

36 

141 

42 

280 

34 

208 

35 

167 

30 

115 

33 

168 

40 

271 

36 

115 

37 

163 

(1)  Find  the  equation  of  the  regression  line. 

(2)  Interpret  the  value  of  m. 

(3)  Compute  Sy  by  formula  (1'). 

(4)  Compute  Y  when  X  —  40,  and  interpret 
your  result. 


63.   THE  BRAVAIS-PEARSON  COEFFICIENT 

OF  CORRELATION 

By  far  the  major  objection  to  Sy  as  a  measure  of  the  goodness 
of  fit  of  a  regression  line  to  a  swarm  of  points  is  this:  it  is  a  concrete 
niimher  expressed  in  the  given  Y-unit,  This  fact  renders  it  useless 
for  purposes  of  comparison.  What  we  really  need  is  an  index  for 
measuring  the  closeness  of  fit  that  is  independent  of  the  unit  of  measure^ 
a  pure  number,  a  relative  which  will  measure  the  degree  rather  than 
the  amount  of  the  closeness  with  which  the  regression  line  estimates 
the  observed  values.   We  proceed  to  find  such  a  measure. 

To  accomplish  this  end,  it  is  very  enlightening  to  express  Sy  in 
terms  of  the  ohsei^ed  values.  For  the  sake  of  simplicity,  w^e  shall 
assume  that  the  observed  data  are  referred  to  axes  through  (Mx, 
Afy).   From  equation  (8)  on  page  222  we  have: 

where,  in  terms  of  the  observed  values, 
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If  this  value  of  m  is  substituted  in  the  expression  for  Sp^,  we  obtain: 


and 


n  n 


1  ^  i^^y^' 


Recalling  that 


we  have: 


—  (Tx    and  — ^  =  CTy 


n  n 


and  finally 


or 


SI  =  cr?(l  ~  rlj.)  (2) 


where 

r;.r  =  ^  (3) 

is  the  well-known  Bravais-Pearson  coefficient  of  correlation.^ 

Since  x  and  y  are  the  deviations  of  the  observed  values  from  their 

respective  means,  it  is  evident  that  r  can  be  very  simply  computed 

from  the  observed  values. 

The  coefficient  r  plays  such  an  important  part  in  statistical  analysis 

that  it  is  advisable  for  us  to  show  its  relation  to  the  slope,  m,  of  the 

regression  hne.  Thus: 

2x7/  nrCxCY 


m 


2x2  no-i 
or 

(Tv 

m  =  r-  —  (4) 

^  As  is  our  custom  we  shall  omit  the  subscript  XY  employing  it  only  for 
purposes  of  identification. 


COEFFICIENT  OF  CORRELATION 


239 


The  equations  of  the  regression  line  of  Y  on  X,  (10)  and  (11)  of 
Chapter  7  (p.  222),  now  become: 

y  =  r-^.x  (5) 

and 

K-My=r^(jr~Mx)  (6) 

If  Sy  is  taken  to  be  the  measure  of  goodness  of  fit  of  the  line  of 
regression  of  F  on  X  to  the  observed  points,  or  a  measure  of  the 
closeness  of  the  relationship  of  X  and  F,  it  will  soon  become  evident 
that  r  is  probably  a  superior  measure  for  this  relationship.  From 
equation  (2)  it  is  evident  that  Sy  and  are  positive,  and  therefore 
r  must  lie  in  the  interval  —  1  to  +  1.   That  is: 

-  1  ^  r  ^  +  1 

As  r  approaches  unity  numerically,  Sy  decreases  toward  zero, 
and  this  occurs  when  the  points  in  general  cluster  closely  about  the 
line.  As  r  approaches  zero,  Sy  increases  toward  its  maximum  value, 
ayj  and  this  occurs  when  the  points  in  general  are  widely  dispersed 
about  the  line.  If  r  equals  unity  numerically,  SI  equals  zero,  hence 
each  residual  must  equal  zero,  and  the  observed  points  lie  upon  the 
line.  When  r  equals  unity  numerically,  we  have  what  is  know^n 
as  perfect  correlation  between  the  variables  X  and  F,  for  the  lines 
of  regression  then  describe  the  data  perfectly. 

Therefore  a  high  coefficient  of  correlation  means  a  small  Sy,  and 
consequently  a  close  relationship  between  F  and  whereas  a  low 
coefficient  of  correlation  means  a  large  Syy  and  consequently  a  poor 
relationship  between  X  and  F. 

Thus  we  have  found  our  index  for  we  see  by  (3)  that  r  is  a  pure 
number  (that  is,  it  is  independent  of  any  units  of  measurement), 
and  hence  may  be  taken  as  a  measure  of  the  degree  of  the  relationship 
between  X  and  F.  It  may  therefore  be  used  to  measure  the  relation- 
ship between  variates  expressed  in  any  units,  as,  bushels  and  dollars, 
inches  and  pounds,  marks  in  English  and  marks  in  mathematics  on 
different  scales,  and  so  on. 

In  Chapter  7  we  learned  that  if  the  slope  is  positive,  F  increases  as 
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X  increases;  if  the  slope  is  negative,  Y  decreases  as  X  increases. 
From  equation  (4),  since  Cx  and  ay  are  always  positive,  m  is  positive 
if  r  is  positive  and  m  is  negative  if  r  is  negative.  Therefore,  it  fol- 
lows that  if  r  is  positive,  Y  increases  with  X,  and  if  r  is  negative, 
Y  decreases  as  X  increases.  The  converse  of  this  statement  is  also 
true. 

The  remarks  that  we  have  just  made  about  correlation  have  been 
from  a  mathematical  standpoint.  As  we  proceed  in  our  study, 
however,  these  abstract  notions  will  be  clothed  with  real  meaning. 
We  are  aware  that  certain  characters  tend  to  rise  and  fall  together 
as  if  connected  by  some  direct  causal  relation  —  for  examples,  tall 
men  in  general  weigh  more  than  short  men,  young  husbands  in 
general  are  married  to  young  wives,  a  falling  barometer  usually 
signifies  an  approaching  storm,  an  abnormally  small  crop  in  general 
results  in  a  higher  price  for  the  product.  In  other  words,  we  are 
aware  of  the  existence  of  certain  persistent  relationships  between 
pairs  of  variables. 

The  existence  of  this  persistent  relationship  between  paired  vari- 
ables is  the  important  feature  of  correlation.  The  variables  may 
in  general  fluctuate  directly  or  inversely,  that  is,  high  values  of 
one  variable  will  in  general  be  paired  with  high  values  of  the  other 
variable,  or  high  values  of  one  variable  will  in  general  be  paired 
with  low  values  of  the  other  —  in  either  case  they  are  said  to  be 
correlated. 

Therefore: 

Correlation  may  be  defined  as  tendency  toward  concomitant  variation, 
and  a  so-called  coefiicient  is  simply  a  measure  of  such  tendency,  more  or 
less  adequate  according  to  the  circumstances  of  the  case.^ 

In  the  few  preceding  pages  we  have  suggested  three  expressions  for 
this  relationship,  namely,  (1)  the  equation  of  the  line  of  regression, 
(2)  the  value  of  the  standard  error  of  estimate,  and  (3)  the  coeffi- 
cient of  correlation.  Each  expression  has  its  use,  and  we  shall  neglect 
none  of  them,  but  by  far  the  greatest  emphasis  will  be  given  the 
coefficient  of  correlation.^ 

^  William  Brown  and  G.  H.  Thomson,  Easeniiala  of  Mental  Measurement, 
3d  ed.,  1921,  p.  97. 

*  If  it  is  desired,  Section  66  (p.  247)  may  now  be  read  to  advantage. 
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64.   COMPUTATION  OF  r  FOR  UNGROUPED  DATA 

Since  r  plays  such  an  important  r61e  in  the  study  of  relationships, 
we  shall  devote  considerable  attention  to  its  computation  and  to  its 
interpretation. 

The  following  should  be  the  tabular  arrangement  for  computing  r 
for  ungrouped  data  when  the  computation  is  based  upon  formula  (3). 


X 

Y 

X  =  X  - 

y  ^  Y  -  My 

xy 

= 

Mx  = 

XY  = 

My  = 

Zxy 

The  following  steps  should  be  followed  in  the  arithmetical  summary: 

1.  Find  2X,  then  iVx.  3.  Find  I^x^,  Zxi/,  and  Zt/^. 

2.  Find  2 F,  then  Mk.  4.  Find  ax,       and  r. 

5.  Find  rn  from  equation  (4),  or  from  m  ==  Zxy/'^x^, 

6.  Write  the  regression  equation  of  Y  on  X  using  equation  (6). 

7.  Obtain  the  computed  vakies  of  Y  if  they  are  desired. 

8.  Find  Sy  from  equation  (2). 

The  table  on  the  following  page  will  illustrate  the  steps  recom- 
mended in  the  preceding  summary. 

We  have: 

n  =  15       SX  =  1402.8       2F  =  876.4 
Ixy  =  -  1447.72       Sx^  =  2509.21       Sy'  =  1852.48 
Mx  =  93.5  bu.    My  =  58.4^   (Tx  =  12.93  bu.    ay  =  ll.lli^ 
r  =  -  0.672      Sy  =  8.23^ 
-  0.672(11.11) 

^  =  12:93         =  ^-^^ 

For  the  line  of  regression  of  F  on  X  we  have: 

Y  -  58.4  =  -  0.58(X  -  93.5) 

or 

y  =  -  0.58X  +  112.63 
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Table  55.  The  Average  Yield  per  Acre  a.nd  the  Average  Farm  Price 
PER  Bushel  for  Potatoes  in  the  United  States,  1900-1914  ^ 


Year 

Yield 
ijbushels) 
X 

Price 
(cents) 
Y 

X 

V 

x^ 

1900 

80.8 

43.1 

-12.7 

-15.3 

194.31 

161.29 

234.09 

1901 

65.5 

76.7 

-28.0 

18.3 

-512.40 

784.00 

334.89 

1902 

96  0 

47.1 

A  ff    •  X 

2.5 

—  11.3 

-  28.25 

6.25 

127.69 

1903 

•I-  %J  \J 

84.7 

61.4 

—  8.8 

3.0 

-  26.40 

77.44 

9.00 

1904 

110  4 

45  3 

16  9 

-13.1 

-221.39 

285.61 

mm\mJ^J  •  \~J  JL 

171.61 

1905 

87.0 

61.7 

-  6.5 

3.3 

-  21.45 

42.25 

10.89 

1906 

102.2 

51.1 

8.7 

-  7.3 

-  63.51 

75.69 

53.29 

1907 

95.4 

61.8 

1.9 

3.4 

6.46 

3.61 

11.56 

1908 

85.7 

70.6 

-  7.8 

12.2 

-  95.16 

60.84 

148.84 

1909 

106.1 

54.1 

12.6 

-  4.3 

-  54.18 

158.76 

18.49 

1910 

93.8 

55.7 

.3 

-  2.7 

-  .81 

.09 

7.29 

1911 

80.9 

79.9 

-12.6 

21.5 

-270.90 

158.76 

462.25 

1912 

113.4 

50.5 

19.9 

-  7.9 

-157.21 

396.01 

62.41 

1913 

90.4 

68.7 

-  3.1 

10.3 

-  31.93 

9.61 

106.09 

1914 

110.5 

48.7 

17.0 

-  9.7 

-164.90 

289.00 

94.09 

Total 
Mean 

1,402.8 
93.5+ 

876.4 
58.4+ 

.3 

.4 

-1,447.72 

2,509.21 

1,852.48 

We  have  here  a  fairly  significant  coefficient  of  correlation,  r 
=  —  0.672.  Its  large  numerical  value  warrants  our  belief  that  there 
does  exist  a  significant  relationship  between  the  average  yield  of 
potatoes  and  the  corresponding  price  per  bushel.  The  negative  sign, 
as  previously  stated,  means  that  as  X  increases  Y  decreases.  In 
accordance  with  our  definition  of  slope,  the  value  of  —  0.58  for  m 
means  that  on  the  average,  an  increase  of  one  bushel  per  acre  in  the 
yield  will  mean  a  duninished  price  of  more  than  a  half  a  cent  per 
bushel. 

Now,  let  us  use  our  equation  for  estimating  the  price  that  corre- 
sponds to  a  given  yield,  and  S^y  for  measuring  the  reliability  of  the 
estimate.  Let  X  =  100  bu.  per  acre,  then  Y  estimated  is  —  0.58(100) 
+  112.63  =  54.6  cents.  Since  &y  =  8.23  cents,  the  odds  are  2  to  1 
that  the  observed  F  for  X  =  100  does  not  differ  from  54.6  cents  by 
more  than  8.23  cents. 

1  The  data  are  taken  from  Yearbook  of  Agriculture,  1920,  p.  616. 
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EXERCISES 

1.  The  average  daily  grades  and  the  final  examination  grades  for  ten 
students  in  a  class  in  calculus  are  given  in  the  table  below. 

X  =  the  average  daily  grade 

Y  —  the  grade  on  the  final  examination 

Find  r,  the  line  of  regression  of  Y  on  X,  and  Sy.  If  X  =  90,  F  =  (  ). 


Student 

X 

Y 

Student 

X 

Y 

1 

86 

71 

6 

96 

94 

2 

93 

76 

7 

80 

71 

3 

73 

61 

8 

70 

60 

4 

66 

62 

9 

95 

85 

5 

88 

75 

10 

63 

55 

2.  The  following  table  ^  gives  the  results  of  experiments  performed  at 
Delhi,  California,  to  determine  the  effect  of  irrigation  upon  the  yield  in 
alfalfa. 

Find  r  if 

X  =  the  total  seasonal  depth  (in  inches)  of  water  applied  and 

Y  =  the  average  yield  (tons  per  acre)  for  the  years  1922,  1923,  1924 


X 

Y 

X 

Y 

12 

5.27 

36 

8.20 

18 

5.68 

42 

8.71 

24 

6.25 

48 

8.42 

30 

7.21 

60 

8.24 

3.  In  the  following  table  ^ 

X  ==  the  July  rainfall  (in  inches)  for  the  given  year  for  Ohio,  and 
Y  =  yield  of  corn  (bushels  per  acre) 

Find  r,  iSy,  and  the  regression  equation.  If  X  =  4,  F  =  (  ).  Interpret. 

^  The  data  are  from  University  of  California  Experiment  Station,  Bulletin 
No.  460,  p.  8. 

'  The  data  are  taken  from  Monthly  Weather  Review,  Vol.  42  (1914),  p.  80. 
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y 

Vpnt 

J.  C/U>/ 

X 

V 

1900 

4.6 

42.6 

1905 

3.9 

37.9 

1901 

2.7 

30.0 

1906 

5.1 

42.2 

1902 

4.7 

38.8 

1907 

5.4 

34.8 

1903 

3.7 

31.5 

1908 

4.1 

36.1 

1904 

4.1 

32.8 

1909 

3.8 

38.7 

65.    OTHER  FORMS  OF  r 

The  correlation  coefficient,  r,  as  we  have  defined  it  by  equation  (3) 
of  Section  63  is  expressed  in  terms  of  the  deviations  of  the  variates 
from  their  respective  means,  Mx  and  My-  Since  Mx  and  My  usually 
require  several  decimals  for  their  results,  we  shall  follow  the  plan 
that  we  have  used  previously  in  Sections  34  (p.  125)  and  44  (p.  164) 
in  computing  cr,  as,  and  a^.  The  labor  of  computation  can  be  greatly 
reduced  by  expressing  r  in  terms  of  the  original  variates  X  and  F, 
or  in  terms  of  x'  and  y\  where  x'  and  y'  are  deviations  in  class  units 
from  some  fixed  origin  (/?.,  k). 

In  Chapter  4  we  have  seen  that: 


=  -         or   n(Xx  =  VnSZ^  -  (SZ)^ 

Similarly: 

ay  =  sj^~  ~         or   nay  =  VnSF^  _  (2^)2 
Also,  since 

y^Y  -My 

we  have : 

xy=^XY-  MyX  -  MxY  +  MxMy 

and 

Sxy  =  2X7  ~  MySX  -  M^SF  +  nMxMy 
RecaUing  that 

2X  =  nMx   and   2F  =  nMy 

we  have: 

2x2/  =  2XF  -  nMxMy 
The  formula  for  r  can  now  be  expressed  in  the  useful  form: 
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SATF  -  tlMxMy 
r  =    .  /  (7) 


Formula  (7)  may  also  be  written 


MxMy 


We  shall  next  derive  a  formula  for  r  in  which  the  X  and  Y  variates 
will  be  expressed  in  their  respective  class  widths  as  units  and  measured 
from  some  arbitrary  origin  (/i,  A:). 

Let:  C  be  the  centroidal  point  (M^,  My) 
0'  be  the  arbitrary  origin  (h^  k) 
Wx  =  the  class  width  of  the  X  variates 
Wy  =  the  class  width  of  the  Y  variates 

Mx  =  h  +  wjbx    where    hx  =    or   nhx  = 


n 


(^x  =  '^x\/  ~  bl 

Similarly,  we  can  find: 

My  —  k  +  Wyhy    where    by  =  — ^   or   nby  =  Si/' 


From  Figure  40  we  have  the  following  relations: 

a.  X  —  X  -\-  Mx  b.  X  =  h  +  WxX'  c.  x  ^  Wxx'  —  Wxbx 

Y  =  y  +  My  7  =  /:  +  Wyy'  y  =  Wyy'  ~-  Wyb^ 

Applying  the  relations  c.  above,  we  have: 

xy  =  WxWy{x'y'  —  byx'  —  6xT/'  +  bxby) 

and  hence 

Zxy  =  WxWy(2x^y^  —  bySa;'  —  bx^y^  +  nbxfcy) 
Substituting    Sx'  =  nbx   and    St/'  =  nby,  we  have: 

Sxy  ==  WxWy(Zix'y'  —  n6x6v) 
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h 


^  I 


Figure  40 


I 
I 
I 
I 
I 
I 


I 


\c 


"'To 


0 


units ' 
i  I 

—  X  — 


P 


W3,  X* given, 
units 


T 


CO 


1 


•z 


Replacing  in  equation  (3)  l^xy  by  the  value  just  found  and  Cx  and 
ay  by  their  values  in  terms  of  the  primed  letters,  we  have: 


 bxbu 


the  class  widths  canceling  in  the  process. 
By  simple  transformations  equation  (8)  reduces  to. 

The  following  example  will  illustrate  the  method  of  procedure  for 
computing  r  by  either  formula,  (8)  or  (9). 

X  =  the  grade  on  the  first  test 
Y  =»=  the  grade  on  the  second  test 
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Table  56.  Grades  of  Two  Tests  of  10  Students 

IN  Integral  Calculus 


Student 

X 

Y 

x' 

1 

85 

77 

10 

7 

70 

100 

49 

2 

82 

77 

7 

7 

49 

49 

49 

3 

91 

82 

16 

12 

192 

256 

144 

4 

80 

74 

5 

4 

20 

25 

16 

6 

75 

70 

0 

0 

0 

0 

0 

6 

95 

87 

20 

17 

340 

400 

289 

7 

83 

77 

8 

7 

56 

64 

49 

8 

85 

77 

10 

7 

70 

100 

49 

9 

88 

82 

13 

12 

156 

169 

144 

10 

77 

71 

2 

1 

2 

4 

1 

Total 

91 

74 

955 

1,167 

790 

Let  h  =  75,  k  =  70,  Wx  =  1,  and  Wy  =  1. 
We  have  from  the  table: 

Sa:'  =  91  S2/'  =  74  Sa:y  =  965  Xx''  =  nd7  2i/'2  =  790  and  n  =  10 
Hence: 

K  =  9.1     by  =  7.4             =  95.5     —  =  116.7  ^  =  79 

^                n                   n  n 

Therefore  by  (8)  : 

r  ^  95.5  -  (9.1)(7.4)  ^ 

V116.7  -  82.81  V79  -  54.76 

66.  SUMMARY  AND  EXTENSION  OF  THE  THEORY 

OF  CORRELATION 

In  Chapters  7  and  8  we  have  assumed  that  our  data  could  be 
represented  by  the  straight-line  equation,  Y  =  rriiX  +  fci,  in  which 
X  is  the  independent  variable  and  Y  the  dependent  variable.  By 
minimizing  the  sum  of  the  squares  of  the  F-residuals,  we  derived  the 
normal  equations: 

miSX2  +  biSZ  ==  2ZF 
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Solving  these  normal  equations  for  mi  and  61,  we  obtained 

nSXY  -  SATSF 


mi  = 


(5)  of  Section  59 


and  hence  the  equation  of  the  line  of  regression  of  F  on  X  may  be 
found.  This  equation,  for  assigned  values  of  X,  gives  the  most 
probable  values  for  Y.  This  Une  (see  Exercise  4  on  p.  221)  passes 
through  the  point  {Mx,  My)  and  (see  Exercise  5  on  p.  221)  also 
possesses  the  property  that  the  sum  of  the  F-residuals  from  it  is  zero. 

If  the  square  root  of  the  mean  of  the  squares  of  the  F-residuals  be 
taken  as  a  measure  of  the  closeness  of  the  concentration  of  the  points 
about  the  line,  we  find: 


where 


r  = 


Then: 


mi  = 


(Ty 

r— 


(2)  of  Section  63 


(3)  of  Section  63 


(4)  of  Section  63 


and  the  equation  of  the  line  of  regression  of  F  on  X  becomes: 


(6)  of  Section  63 


In  like  manner  we  may  arrive  at  similar  results  by  basing  our 
procedure  upon  the  equation  X  =  7ri2F  +  62,  where  F  is  now  the 
independent  variable  and  X  is  the  dependent  variable.^  If  we 
minimize  the  sum  of  the  squares  of  the  X-residuals  we  arrive  at  the 
normal  equations: 

W22F  +  nb2  =  SX 
m2SF2  +  622F  =  2XF 


If  these  equations  be  solved  for  m2  and  ^2,  we  obtain: 


1  Note  that  m2  is  not  the  slope  of  this  line. 
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nEXY  -  SJrSF 

/Ha  = 


nSP  -  (Sy)« 

nsy»  -  (sy)» 


(10) 


and  hence  the  equation  of  the  line  of  regression  of  X  on  F  may  be 
found.  This  equation,  for  assigned  values  of  Y,  gives  the  most 
probable  values  of  X.  This  line  also  passes  through  the  point 
(Mx,  My)  and  possesses  the  property  that  the  sum  of  the  X-residuals 
from  it  is  zero. 

If  the  square  root  of  the  mean  of  the  squares  of  the  X-residuals  be 
taken  as  a  measure  of  the  closeness  of  the  concentration  of  the 
points  about  this  line  of  regression  of  X  on  F,  we  find: 


=  <rj,Vr^'  (11) 

where,  as  before, 

no'xO'Y 

The  value  of  m2  may  now  be  written 

m2  =  (12) 
ay 

and  the  equation  of  the  line  of  regression  of  JST  on  Y  may  be  written: 

X  =  r —  •  y 

or 

AT  ~  Mjr  =  r^(Y  -  My)  (13) 

We  can  therefore  obtain  two  straight  lines  which  fit  the  given  n 
points  according  to  the  principle  of  least  squares.  We  can  minimize 
the  sum  of  the  squares  of  the  F-residuals  of  the  line  Y  —  rriiX  +  hi 
and  obtain  the  regression  line  of  Y  on  X  given  by  equation  (6). 
This  hne  is  to  be  used  to  find  the  most  probable  Y  for  a  given  X. 
We  can  minimize  the  sum  of  the  squares  of  the  X-residuals  of  the 
line  X  =  +  62  and  obtain  the  regression  Hne  of  X  on  F  given 
by  equation  (13).  It  is  to  be  used  to  find  the  most  probable  X  for 
a  given  F. 
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Question:  For  what  values  of  r  will  the  Hnes  (6)  and  (13)  coin- 
cide? 

The  quantities  mi  and  nii  are  called  coefficients  of  regression.  It 
may  be  noted  that: 

=  miTTia  (14) 

If  the  deviations  of  the  X  arid  Y  variates.from  their  respective 
means  be  expressed  in  units  of  their  standard  deviations,  that  is,  if 

^      X      X-Mx       .          y  Y^My 
t  =  —  =   and   s  =  —  =   y 

the  equation  (6)  becomes: 

5  =  r/  (15) 
That  is,  r  is  the  slope  of  the  line  of  regression  of  F  on  X  when  the 
variates  x  and  y  are  expressed  in  standard  units. 


EXERCISES 

1.  Using  the  data  in  Table  12  (p.  47),  find  the  correlation  between  the 
quantity  of  beef  available  for  consumption  and  the  price  per  hundred- 
weight.  Let  X  equal  the  quantity  available  and  Y  equal  the  price. 

2.  In  the  following  table  the  cows  considered  were  of  the  same  breed 
under  the  same  management.  Find  r. 


Value  of  Food  Consumed  by  26  Cows  and  Value  of  Products  per  Cow  ^ 


Value  of  Feed 
Consumed 
X 

Value  of  Product 
per  Cow 
Y 

Value  of  Feed 
Consumed 
X 

Value  of  Product 
per  Cow 
Y 

$99.83 

$246.10 

$98.93 

$174.64 

86.42 

207.76 

82.69 

143.61 

91.05 

216.52 

82.94 

143.18 

94.05 

220.01 

87.03 

150.02 

94.06 

214.87 

89.07 

153.51 

86.06 

183.53 

83.52 

143.61 

84.20 

176.39 

83.10 

140.46 

86.70 

178.56 

89.16 

150.68 

86.75 

178.11 

83.01 

136.60 

86.57 

166.70 

89.32 

145.41 

88.52 

169.20 

82.22 

131.35 

94.01 

179.25 

99.74 

157.28 

86.23 

157.20 

84.77 

122.22 

*  The  data  are  from  Horace  Secrist,  Readings  and  Problems  in  Statistical 
Methods,  1920,  p.  420. 
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3.  In  the  following  table: 
X  =  production  in  million  of  bales 

Y  =  price  per  pound  in  cents  received  by  producers  December  1 


Production  and  Price  of  Cotton  in  the  United  States/  1907-1929 


Year 

X 

Y 

Year 

X 

Y 

Year 

X 

y 

1907 

11.1 

10.4 

1920 

13.4 

13.9 

1917 

11.3 

27.7 

1908 

13.2 

8.7 

1921 

8.0 

16.2 

1918 

12.0 

27.6 

1909 

10.0 

13.9 

1922 

9.8 

23.8 

1919 

11.4 

35.6 

1910 

11.6 

14.1 

1923 

10.1 

31.0 

1911 

15.7 

8.8 

1924 

13.6 

22.6 

1912 

13.7 

11.9 

1925 

16.1 

18.2 

1913 

14.2 

12.2 

1926 

18.0 

10.9 

1914 

16.1 

6.8 

1927 

13.0 

19.6 

1915 

11.2 

11.3 

1928 

14.3 

18.0 

1916 

11.5 

19.6 

1929 

14.5 

16.4 

a.  Find  r  for  the  ten-year  period,  1907  to  1916  inclusive. 

b.  Find  r  for  the  ten-year  period,  1920  to  1929  inclusive.  The  years 
1917,  1918,  and  1919  were  abnormal  years,  and  may  be  omitted  from  the 
computation. 

4.  Using  the  relation 

(x  -  yy  =      -  2xij  +  ?/ 

show  that: 

AT  2     I     ^  2  ^2 
(TX  -J-  O^Y  ~  CTX-  Y 

r  =  

2ax<T  Y 

In  order  to  compute  the  value  of  r  by  tliis  formula,  what  are  the 
implied  restrictions  upon  the  X  and  Y  units? 

6.  Verify  the  value  of  r  for  the  data  of  Table  56  by  using  the  formula 
of  Exercise  4. 

6.  Find  the  value  of  r  for  the  '^Savings  Bank,  Strikes  and  Lockouts'' 
data  of  Exercise  3,  page  236.  Is  the  formula  of  Exercise  4  applicable  to 
these  data? 

7.  Show  that 

TaX+h     cY-i-d  =  TxY 

8.  Given 

-.2 

rxy=l---?~-l~--i» 

(T  y  (Ty 


^  The  data  are  from  Yearbook  of  Agriculture,  1928,  p.  837;  Commerce  Year- 
book, 1930,  p.  216. 
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show  that 


(a)  r^y  =  1  ~ 


—  niZxy 


(b)  rxY  =  ± 


9.  For  a  given  X  we  estimate  Y  (call  it  F„f.)  by  the  equation  (see 
Section  63) 

=       (X  ~  Mx)  +  My 

(a)  Show  that  the  arithmetic  mean  of  the  estimated  values  of  Y  is  equal 
to  the  arithmetic  mean  of  the  observed  values  of  F,  or  that 

Mye^t.  =  My 

(b)  Show  that 


and  thus  that 
the  sign  to  be  that  of  m. 
10.  = 


Yttt. 


X 


2 
4 
6 
8 
10 
12 


6 
8 
10 
12 
14 
16 


X 


y 


xy 


(1)  Plot  the  data. 

(2)  Complete  the  table. 

(3)  Compute  a^. 

(4)  Compute  cy. 

(5)  Compute  r. 

(6)  Compute  m. 

(7)  Find  the  regression  line. 


11. 


(a) 


(b) 


X 

Y 

F 

1 

10 

1 

10 

2 

8 

2 

7 

3 

6 

3 

4 

4 

4 

4 

1 

5 

2 

5 

-  2 

Treat  the  data  in  the  accompany- 
ing tables  as  you  did  those  in 
Number  10  above. 


12. 


(a) 


(b) 


X 

F 

X 

F 

0 

12 

0 

7 

3 

5 

3 

4 

5 

3 

4 

3 

7 

0 

5 

0 

-  3 

5 

-  3 

4 

~  5 

3 

-  4 

3 

-  7 

0 

~  5 

0 

Treat  the  data  in  the  accompany- 
ing tables  as  you  did  those  in 
Number  10  above. 
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13.  The  equations  of  the  lines  of  regression  for  a  set  of  data  are  y  =  0.72a; 
and  X  =  0.64r/.  What  is  the  value  of  r  for  the  data? 

14.  The  equations  of  the  lines  of  regression  for  a  set  of  data  are 
y  =  ~  0.8x  and  z  =  —  0A5y,  What  is  the  value  of  r? 

67.   COMPUTATION  OF  r  FOR  GROUPED  DATA 

For  sufficiently  small  values  of  n,  say  n  <  30,  one  of  the  methods 
employed  in  the  preceding  sections  is  usually  used  in  computing  the 
coefficient  of  correlation. 

If  n  is  a  very  large  number,  we  are  compelled  to  construct  a  double- 
entry  table.  To  construct  such  a  table  (see  Table  57),  the  sheet  is 
ruled  horizontally  and  vertically,  thus  dividing  the  sheet  into  a 
system  of  columns  and  a  system  of  rows,  each  of  which  is  a  frequency 
distribution.  Each  of  the  rectangles  in  a  row  or  column  is  called  a 
cell.  Along  the  left-hand  margin  from  bottom  to  top  are  laid  off  the 
class  intervals  or  the  class  marks  of  the  Y  variates,  and  along  the 
top  of  the  diagram  from  left  to  right  are  laid  off  the  class  intervals 
or  the  class  marks  of  the  X  variates.  Very  much  as  we  plot  points 
on  an  ordinary  F-coordinate  system,  each  observed  individual 
may  now  be  located  on  this  sheet,  the  preliminary  or  tally  sheety 
with  respect  to  the  X  and  Y  measures.  We  shall  locate  each  indi- 
vidual on  the  preliminary  sheet  with  a  +  sign  placed  \vithin  the 
appropriate  cell.  Since  we  shall  finally  concentrate  all  the  measures 
in  a  given  cell  at  its  center,  it  is  not  nece^ary  that  the  points  be 
plotted  with  more  precision  than  is  necessary  to  locate  them  in  the 
appropriate  cells.  When  all  the  individuals  are  accurately  located 
we  have  a  scatter  diagram. 


Table  57 


Xi 

X, 

X, 

Xp 

cell 

y, 

Yt 

Yi 
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A  correlation  table  may  now  be  obtained  from  the  preliminary  sheet 
by  writing  within  each  cell  the  number  of  +  marks  which  fall  within 
it.  This  number  is  called  the  cell  frequency.  We  shall  indicate  a  cell 
frequency  by  f{x,  y).  The  table  is  now  used  for  a  work-sheet  or  a 
computation  sheet. 

The  numbers  in  a  column  corresponding  to  an  assigned  X,  say 
X  =  Xi,  form  a  F-array  of  the  type  Xi,  and  those  in  a  row  corre- 
sponding to  an  assigned  F,  say  Y  =  Fi,  form  an  X-array  of  type  Fi. 

The  correlation  table  may  be  represented  geometrically  by  a  sur- 
face. At  the  center  of  each  cell  imagine  a  vertical  erected  with  a 
height  proportional  to  the  cell  frequency.  If  the  tops  of  these  verticals 
be  joined,  an  irregular  surface  results.  If  the  cells  are  made  smaller 
and  smaller  while  the  frequencies  remain  finite,  the  irregular  surface 
will  approach  a  regular  surface  which  is  called  a  frequency  surface 
or  a  correlation  surface. 

Since  in  this  chapter  we  are  dealing  with  grouped  data,  it  is  ad- 
visable that  we  write  our  formulas  for  r  in  the  frequency  forms. 
Thus  equations  (3),  (7'),  (8),  and  (9)  become: 

,  =  ^MOlmI  (16) 

ncrgcry 

r  =  n^XYfix,  y)  -  ZXmZYfjy) 

VnS-P/(x)  -  [SAr/(x)]«  VnSP/(i/)  -  [Sy/(y)]» 

Sxy/(x,  y)  _ 
n 


r  = 


riSx'y'fix,  y)  -  ^x'f(x)^y'm  , 
VnSx'^fix)  -  [Sx'/(x)3»  VnLy'yiy)  -  [Sy'/Cy)? 


The  data  of  Table  58  will  be  used  as  an  example  to  illustrate  the 
construction  of  the  preliminary  sheet,  the  correlation  table,  and  the 
method  employed  in  computing  r,  the  regression  equations,  and 
the  standard  error  of  estimate. 

We  shall  let  the  percentage  of  native  white  population  be  measured 
along  the  horizontal  or  Z-axis,  and  the  percentage  of  illiteracy  be 
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Table  58.  Percentage  of  Native  White  and  Percentage  op 
Illiterate  Ten  Years  of  Age  and  Over  in  the  Popula- 
tion OF  Pennsylvania,  by  Counties,  1920  ^ 


County 

Percentage 

Nature 

W  fitte 
Y 

Percentage 
Illiterate 

V 

1  A 

/  '±.0 

A.  ft 

Arnisiirong  . 

A  Q 

-oeaver  .... 

/o.y 

A  O 

r>eciiorci. . . . 

yo.y 

yo.4: 

O  7 

0  A 

jDrauiora. . . 

OA  o 

yo.J 

Q7  A 
o/  .0 

QO  A 

o.U 

v^amDna .  .  . 

70  O 

A  Q 

o.o 

\^aiiieron . .  . 

oy.y 

9  A 

L/arDon .... 

Q  Q 
O.O 

■  Via 

yo.o 

1  Q 
i.O 

L/iiester .... 

Ql  7 

4.0 

vyiarion .... 

yo.Tc 

1  ft 
1  .o 

vyiearneici  . . 

c5o.y 

A  A 

v^iinton .... 

ycS.u 

0  f; 

L/OiumDia 

OQ  o 

0  ft 
Z.o 

v^rawiorci  . . 

yz.o 

1.0 

Ill  1  nr\  r\ot*  1  q  riM 

Vy'tllllUUl  idiliU. 

uaupuin .  . . 

ft7  ft 
Of  .o 

O.O 

ueiaware. . . 

7f^  ft 
/  O.o 

4.4 

ftl  A 
oi.O 

Q  1 
O.i 

Erie  

84.8 

4.0 

Fayette  

76.3 

8.2 

Forest  

94.1 

2.7 

Franklin .  .  . 

97.0 

1.8 

Fulton  

98  9 

2.3 

93.5 

4.4 

Huntingdon 

93.1 

4.0 

Indiana .... 

82.4 

5.9 

Jefferson . . . 

88.0 

3.5 

Juniata .... 

99.1 

1.2 

County 


Lackawanna. . . . 

Lancaster  

Lawrence  

Lebanon  

I^high  

Luzerne  

Lycoming  

McKean  

Mercer  

Mifflin  

Monroe  

Montgomery.  .  . 

Montour  

Northampton . . . 
Northumberland 

Perry  

Philadelphia .... 

Pike  

Potter  

Schuylkill  

Snyder  

Somerset  

Sullivan  

Susquehanna . . . 

Tioga  

Union  

Venango  

Warren  

Washington .... 

Wayne  

Westmoreland .  . 

Wyoming  

York  


Percentage 
Native 
White 
X 


77.1 
96.3 
80.2 
95.3 
89.3 
77.3 
94.4 
86.2 
80.0 
96.4 
95.0 
83.4 
94.2 
81.9 
88.9 
98.9 
70.7 
90.6 
93.0 
84.0 
99.8 
84.3 
90.4 
91.0 
93.4 
99.3 
92.7 
86.2 
74.0 
90.9 
77.8 
96.0 
97.2 


Percentage 
Illiterate 


8.6 
1.4 
6.5 
2.5 
2.3 
9.5 
1.5 
1.8 
6.2 
2.4 
2.3 
3.6 
4.8 
5.2 
4.7 
1.5 
4.0 
1.3 
1.9 
7.9 
2.1 
6.4 
4.5 
2.8 
2.4 
1.4 
3.5 
3.3 
7.3 
2.8 
7.6 
1.6 
1.6 


measured  along  the  vertical  or  F-axis.  The  class  widths,  Wx  and  Wy, 
may  be  selected  in  accordance  with  the  principles  suggested  in 

1  The  data  are  from  Fourteenth  Census  of  the  United  States,  Vol.  Ill,  pp.  859-65. 


256 


SIMPLE  CORRELATION 


Preliminary  Sheet 
Percentage  Native  White 


I 

a 
-♦J 

a 

S 
(1* 


<^69.95         75.95  8L95  87.95 

y\  72.95  78.95  84.95  90. 
9.95 


93.95  99.95 
95  96.95 


8.95 

7.95 

6.95 

5.95 

4.95 

3.95 

2.95 

L95 
0.95 


+ 

+ 

+ 

+  +  + 

+ 

+  + 

4- 

4-  + 

4-4- 

+  + 

+ 

+ 

-f  4- 

4-  + 

+ 

+ 

4- 

+  4-4- 

+  +  + 
+  +  + 

+  +  + 
+  + 

+  + 

4- 

+ 

+  +  + 

+  +  + 
+  + 

+  +  + 
+  +  + 

Section  13  (p.  30).  Since  the  X  variates  range  from  70.7  to  99.8, 
we  shall  choose  Wx  =  3,  and  since  the  Y  variates  range  from  1.2  to 
9.5,  we  shall  choose  Wy  ^  \,  Also,  since  the  given  measures  are 
accurate  to  tenths,  we  shall  express  our  class  boundaries  to  hun- 
dredths.^ Plotting  the  points,  we  have  the  preliminary  sheet. 

The  preliminary  sheet  is  now  complete.  We  are  now  ready  to 
transcribe  the  results  of  the  tally  to  the  computation  sheet.  We 
then  have  Table  59. 

Having  formed  the  correlation  table,  which  is  the  part  of  the 
table  bounded  by  the  double  lines,  we  arrange  the  computation  to 

^  If  the  student  prefers  he  may  use  some  other  method  for  fixing  the  class 
limits.  Any  method  recommended  in  Section  12  (p.  23)  will  be  satisfactory. 
Thus  the  X-class  intervals  may  be  70.0-72.9,  73.0-75.9,  etc.,  and  the  F-class  m- 
tervals  may  be  1.0-1.9,  2.0-2.9,  etc.  The  class  marks  will  be  changed  accordingly. 
The  X-class  marks  will  become  71.45,  74.45,  etc.,  and  the  y-class  marks  will  be- 
come 1.45.  2.45.  etc. 
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simplify  as  far  as  possible  the  somewhat  comphcated  details.  We 
first  add  the  frequencies  of  the  rows  and  columns  and  obtain  the 
row  marked /(x)  and  the  column  marked /(t/).  Choosing  an  arbitrary 
origin  (/i,  k)  near  the  center  of  the  table  —  in  Table  59  (/i,  k) 
=  (83.45,  4.45)  —  and  the  class  intervals  as  units  of  measurement, 
we  obtain  the  row  marked  x'  and  the  cokimn  marked  y\  That  is,  we 
use  the  famiUar  transformations  X  =  h  +  w^x'  and  Y  =  k  +  Wyy\ 
The  next  two  rows,  x'f{x)  and  x''^f{x),  and  the  next  two  columns, 
yjiy)  and  y''^f{y)j  are  self-explanatory  and  are  used  in  computing 
the  means,  Mx  and  My,  and  the  standard  deviations,  cfx  and  (Ty. 

The  column  headed  x'y'jix,  y)  needs  some  explanation.  Recalling 
formula  (18)  for  computing  r,  we  note  that  we  must  find  T^x'yJiXy  y). 
That  is,  we  must  find  the  x'y'  for  each  individual  measured,  then  find 
their  sum.  Since  the  frequency  of  any  cell  is  concentrated  at  the 
center  of  the  cell,  we  shall  compute  the  x^y'  for  the  frequency  of  each 
cell  by  multiplying  the  x'y'  of  each  cell  by  the  cell  frequency,  and 
adding  the  x'y'  for  all  the  cells  of  a  given  row.  In  this  manner  we 
obtain  the  numbers  in  the  column  headed  x'y'}{x,  y).  By  adding  the 
x^y'  of  all  the  rows,  we  obtain  the  sum  of  the  x't/'  of  the  entire  table.^ 
Thus: 

for  row  Y  =  8.45,  the  total  xY  is  (-  2)  (4)2  +  (0)(4)1  =  ~  16 
The  total  x^y'  for  each  of  the  other  rows  is  found  in  a  similar  manner. 
Consequently,  for  the  entire  distribution  we  have: 

n  =  67,    h  =  83.45,    k  =  4.45,    ti^x  =  3,    Wy  =  I 
2x7(0:)  =  116         2i/7(t/)  =  -  52         2xJ{x)  =  612 
^y''f{y)  =  342     2:x'2/7(x,i/)  =  -  362 
Therefore: 

6,  =  Mx=-  83.45  +  3  =  88.64% 

by  =  My  =  4.45  ~  II  =  3.67% 

ax  =  3(2.48)  =  7.44% 

*  A  row  x'y7(x,  y)  is  similarly  found.  It  is  useful  for  checking  the  columD 
«'y7(x,  2/). 
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<Ty  =  2.12% 


Using  equation  (18)  we  have: 

-  362 


\67A  67 ;  ^  _  Q , 


(2.48)  (2. 12) 

The  equations  of  the  lines  of  regression  can  now  be  found: 

=  ,Zy^-  0.77(2.12)  ^ 
""'^     "^(Tx         3(2.48)  "-"^"^ 

Using 

Y  -  My  =  rni{X  -  Mx) 
we  obtain  the  equation  for  the  regression  of  F  on  X  with  its  Sy,  It  is: 
Y  -  3.G7  =  -  0.22(X  -  88.64)    or    F  =  -  0.22X  +  23.17 


Sy  =  2.12V1  -  (.77)2  ^  1  35^^ 

For  a  given  value  of  X,  this  equation  gives  the  best  (that  is,  the 
most  probable)  value  for  F.  This  most  probable  value  of  F  is  the 
mean  of  the  F-array  corresponding  to  a  given  value  of  X.  Hence, 
the  equation  above  gives  the  expected  mean  ^  of  the  F-array  for 
a  given  X,   We  use  Sy  to  measure  its  reliability. 

For  example,  if  X  =  86.45,  we  obtain  F  =  4.15  for  the  estimated 
or  expected  mean  of  the  F-array.  We  may  compare  this  with  the 
observed  mean  for  X  =  86.45  by  computing  the  mean  of  the  dis- 
tribution in  the  usual  manner.  We  find  the  observed  mean  for  the 
F-array  corresponding  to  X  =  86.45  to  be  3.28. 

When  X  =  86.45  we  found  the  estimated  F,  Yest.y  to  be  4.15. 
Combining  this  value  with  its  measure  of  reliabiUty  S^  =  1.35  we 
have  this  fact:  the  odds  are  2  to  1  that  the  observed  F  for  X  =  86.45 
does  not  differ  numerically  from  Vest.  -  4.15  by  more  than  1.35. 

^  It  may  be  shown  that  the  line  of  regression  of  F  on  X  is  the  line  which  best 
fits  the  points  which  designate  the  means  of  the  F-arrays  or  columns,  and  that 
the  line  of  regression  of  X  on  F  best  fits  the  points  which  designate  the  means  of 
the  X-arrays  or  rows. 
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In  other  words,  the  odds  are  2  to  1  that  for  X  =  86.45%  the  ob- 
served Y  will  lie  in  the  interval  4.15  ±  1.35%. 

^  -  0.77(7.44)  _ 

Using 

X  -  Mx  =  m2{Y  -  My) 

we  obtain  the  equation  for  the  regression  of  X  on  F  with  its  Sx*  It  is: 

X  -  88.64  -  -  0.27(F  -  3.67) 

or 

X  =  -  0.27F  + 89.63 
Sx  ==  7.44V1  -  (.77)2  ^  4j5% 

For  a  given  value  of  F,  this  equation  gives  the  most  probable 
value  for  X.  That  is,  for  a  given  F,  this  equation  gives  the  expected 
mean  of  the  corresponding  X-array. 

For  example,  if  F  =  3.45,  we  obtain  X  =  88.70  for  the  estimated 
mean  of  the  array.  The  observed  mean  of  the  X-array  corresponding 
to  F  =  3.45  is  X  =  87.95.  We  use  Sx  to  measure  the  reliabiUty 
of  the  estimate.  Thus  the  odds  are  2  to  1  that  for  F  =  3.45%  the 
observed  X  will  lie  within  the  interval  88.70  ±  4.75%. 

This  completes  the  theory  of  simple  linear  correlation.  A  word 
about  the  reliability  of  r  may  be  in  order.  If  n  is  fairly  large  and 
if  the  surface  described  on  page  254  is  closely  normal,  the  reliability 
of  r  may  be  tested  by  either  of  the  formulas: 

1  -  r2 


V  n 

1  -  r2 


Er  =  0.6745(r.  =  0.6745  /- 

vn 

with  the  interpretation  of  (Tr  and  Er  similar  to  that  employed  in 
Section  37.  Since  the  assumptions  underlying  these  formulas  are 
rather  severe,  they  are  to  be  used  with  care. 

EXERCISES 

1.  The  data  for  the  table  on  page  261  are  taken  from  the  Yearbooks  of 
Agriculture:  1920,  pp.  753  and  537;  1935,  pp.  568  and  379. 

X  =  price  of  corn  per  bushel  (cents) 
Y  =  value  of  hogs  per  head  (dollars) 
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Construct  a  correlation  table  with  the  X-classes:  20  a.u.  35,  35  a.u.  50, 
etc.,  and  the  F-classes  3.00  a.u.  6.00,  6.00  a.u.  9.00,  etc. 

Find  r,  the  equation  of  the  regression  line  of  Y  on  X,  and  Sy. 

Estimate  Y  when  X  =  lb  and  give  the  odds  that  measure  the  reliabiUty 
of  the  estimate. 


I  ear 

Corn 
Cents 
per  bu. 
X 

Hogs 
Dollars 
per  head 
Y 

Year 

Corn 
Cents 
per  bu. 
X 

Hogs 
Dollars 
per  head 
Y 

4Q 

nPO.Ov/ 

41 

*t  X 

lOf  1 

to 

^  fit 
O.Ux 

IQOft 

40 

O.lo 

ou 

4  01 

1Q07 

7  fi9 

1873 

44 

3.67 

1908 

61 

6.05 

1874 

58 

3.98 

1909 

58 

6.55 

^7 

4  80 

1Q10 

*±o 

Q  17 

xO  1  u 

00 

IQl  1 

Q  ^7 

JLO  f  f 

o.uu 

1Q12 

40 

8  00 

1878 

32 

4.85 

1913 

69 

9.86 

1879 

38 

3.18 

1914 

64 

10.40 

40 

4  28 

1015 

0  87 

J.OO  i. 

64 

4  70 

1016 

X  i/  X  W 

80 

8  40 

4Q 

Q7 

1Q17 

X  i7  X  f 

128 

1 1  7^^ 

1  X  .  f  O 

1883 

42 

6.75 

1918 

137 

19.54 

1884 

36 

5.57 

1919 

135 

22.02 

5  02 

1920 

68 

10  08 

4  26 

1021 

X  i7<U  X 

12  00 

1887 

44 

4  48 

1922 

75 

10  06 

1888 

34 

4.98 

1923 

84 

11.58 

1889 

28 

5.79 

1924 

105 

9.72 

18Q0 

51 

tyX 

4  72 

1925 

70 

12  38 

18Q1 

XOt7  X 

41 

TC  X 

4  15 

1926 

75 

15  21 

X  C/  .M  X 

18Q2 

39 

4  60 

1927 

JL  %J  am  1 

85 

15  97 

1893 

37 

6.41 

1928 

84 

12.03 

1894 

4o 

5.9o 

oU 

lz.z4 

1895 

25 

4.97 

1830 

59 

12.73 

1896 

22 

4.35 

1931 

32 

10.75 

1897 

26 

4.10 

1932 

32 

5.80 

1898 

29 

4.39 

1933 

52 

3.99 

1899 

30 

4.40 

1934 

85 

3.92 

1900 

36 

5.00 

1901 

61 

6.20 

1902 

40 

7.03 

1903 

43 

7.78 

1904 

44 

6.15 
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2,  The  accompanying  table  shows  the  scores  on  placement  examinations 
of  326  freshmen  at  Bucknell  University.  Find  r  and  the  equations  of  the 
lines  of  regression. 

Examination  Scores  in  Mathematics  and  English 


Mathematics 


• 

12.5 

17.5 

22.5 

27.5 

32.5 

37.5 

42.5 

47.5 

52.5 

57.5 

62.5 

67.5 

72.5 

252.5 

1 

1 

1 

1 

237.5 

1 

3 

1 

222.5 

1 

1 

1 

1 

1 

3 

2 

2 

1 

207.5 

1 

2 

3 

3 

5 

2 

1 

1 

2 

192.5 

3 

4 

7 

4 

6 

2 

2 

1 

1 

177.5 

2 

5 

3 

2 

1 

6 

4 

1 

1 

1 

1 

162.5 

1 

1 

6 

6 

8 

18 

7 

6 

5 

3 

3 

147.5 

1 

3 

9 

3 

5 

5 

3 

2 

1 

2 

132.5 

2 

4 

3 

3 

8 

8 

3 

2 

1 

1 

117.5 

4 

5 

6 

5 

9 

4 

1 

3 

1 

102.5 

1 

2 

4 

3 

4 

3 

3 

1 

1 

87.5 

1 

1 

1 

3 

3 

2 

1 

1 

72.5 

2 

3 

3 

5 

2 

1 

57.5 

1 

1 

1 

1 

42.5 

1 

3.  In  the  following  table 

X  =  the  number  of  minutes  required  to  solve  a  group  of  arithmetical 

exercises  by  each  of  forty  employees 
Y  =  the  executive  ratings,  in  per  cent,  of  the  same  employees 


X 

Y 

X 

Y 

X 

Y 

X 

Y 

12.4 

90 

17.2 

11 

24.0 

67 

18.3 

72 

14.0 

85 

8.8 

94 

12.4 

91 

8.5 

96 

15.5 

83 

11.6 

90 

20.2 

74 

10.4 

92 

25.0 

70 

20.6 

68 

16  2 

82 

17.6 

80 

15.8 

80 

9.8 

91 

12.2 

88 

22.4 

72 

23.4 

74 

11.2 

89 

16.0 

82 

15.3 

78 

22.3 

78 

8.7 

96 

13.3 

87 

21.5 

70 

13.5 

88 

25.8 

65 

9.2 

92 

13.2 

87 

17.8 

82 

12.6 

88 

12.4 

87 

17.6 

75 

14.4 

92 

16.5 

74 

26.3 

60 

9.5 

94 
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Construct  a  double  entry  table  with  the  X-classes  designated  as  8.0  a.u. 
12.0, 12.0  a.u.  16.0,  etc.,  the  F-classes  designated  as  60  a.u.  65,  65  a.u.  70,  etc. 
Find  r,  the  regression  Hne  of  Y  on  X,  and  Sy. 

What  is  the  estimated  value  of  F  for  Z  =  20,  and  what  is  the  reliability 
of  the  estimate? 

68.   CORRELATION  RY  RANKS 

When  two  series  of  values  are  expressed  according  to  their  ranks 
and  not  in  terms  of  their  actual  values  or  scores^  we  can  easily  find 
the  approximate  correlation  between  them.  Such  correlation  is  used 
to  find  the  relation  between  the  paired  scores  when  their  number 
is  small  or  when  the  data  do  not  warrant  an  application  of  the  cross 
product  method  to  the  actual  values.  Also,  the  method  is  useful 
in  finding  the  correlation  between  series  that  may  be  arranged  ac- 
cording to  size  and  yet  may  not  be  subjected  to  exact  measurement. 

In  such  correlation  as  we  are  here  describing  we  must  keep  in 
mind  that  the  (X,  Y)  values  are  the  rank  or  position  numbers  of  some 
characteristics.  We  shall  arrange  the  values  in  ascending  order. 
To  the  smallest  value  we  assign  1,  to  the  next  in  order  2,  etc.  We 
may  then  find  the  rank  correlation  by  employing  any  of  our  formulas 
for  TxY  with  the  data  arranged  according  to  ranks.  However,  a 
formula  may  be  easily  derived  for  this  special  case  by  a  method  which 
we  shall  indicate  at  the  end  of  this  section.  When  ranks  are  used 
we  indicate  the  coefficient  by  pxv  or  by  vxy  (rank). 

To  illustrate  the  problem  we  are  presenting,  let  us  consider  the 
heights  and  weights  of  the  five  boys  ^4,  J5,  C,  Z),  E, 


Table  60.  Heights  and  Weights  of  Five  Boys 


Boy 

//  eight 
(inches) 

Weight 
{pounds) 

Rank  in  H eight 
X 

Rank  in  Weight 
Y 

A 

60 

137 

1 

2 

B 

62 

132 

2 

1 

C 

63 

148 

3 

3 

D 

65 

157 

4 

5 

E 

68 

153 

5 

4 

For  the  height-weight  data  given  in  colunms  2  and  3,  r^eight  weight 
=  0.77. 

Let  us  find  the  cross  product  coefficient  for  the  rank  data  given  in 
columns  4  and  5  of  Table  60. 
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X 

Y 

z 

y 

xy 

XY 

1 

2 

-  2 

~  1 

4 

1 

2 

1 

4 

2 

1 

—  1 

o 

1 
1 

A 

1 
1 

3 

3 

0 

0 

0 

0 

0 

9 

9 

9 

4 

5 

1 

2 

1 

4 

2 

16 

25 

20 

5 

4 

2 

1 

4 

1 

2 

25 

16 

20 

15 

15 

0 

0 

10 

10 

8 

55 

55 

53 

Mx  =  My  =  ^  =  3     ax  =  (Ty  =         =  V2 
PxY  =  ^xy  (rank)  =  =  — =      =  0.80 

We  may  also  find  pxy  =  ^xr  (rank)  by  using  formula  (7)  page  245, 


r  = 


We  have 


no'x(^Y 


ay 

Hence 


=  -  {MyY  =  s/^f~Q  =  V2 


,  53  -  5(3)  (3) 

Pxy  =  rxY  (rank)  =     5V2  ^ 

Thus  we  see  that  the  so-called  rank  difference  method  is  merely 
the  cross-product  correlation  between  the  rank  numbers  of  the 
variates.  As  might  be  suspected,  frequently  certain  complications 
arise  to  interrupt  the  apparently  simple  ranking  of  the  values. 
Generally  there  are  several  scores  of  the  same  size,  or  there  exist 
ties  in  the  ranks.  In  such  cases  it  is  customary  to  give  each  the 
mean  of  the  ranks  of  the  positions  that  they  occupy.  Thus,  suppose 
3  tied  for  fifth  place.  Had  there  been  no  ties,  the  ranks  would  have 
been  5,  6,  7.  We  arbitrarily  assign  to  each  place  the  rank  number  6, 
which  is  the  mean  of  5,  6,  and  7.  If  2  scores  tied  for  the  eighth 
place,  we  would  assign  each  the  rank  number  8.5. 

We  shall  now  proceed  to  develop  a  formula  for  finding  the  rank 
coefficient  p^y. 
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Evidently  the  X- values  are  the  numbers  1,  2,  3,  .  .  .,  n,  and  the 
F-values  are  the  same  numbers  but  probably  arranged  in  a  different 
order.  Hence 

SX  =  SF=l  +  2  +  3+  --  -+  n  =  "^^"^^ 
M   -  M   -1    n(n  +  1)  _  fa  +  1) 

Also 

=  Sy^  =  P  +  2^  +  3^  +  •  .  .  +  ^2  =  n(n  +  l)(2n  +  1) 

6 

Hence,  using  formula  (7)  page  128, 

,  /fa  +  1)(2ai  +  1)     (n  +  ly  .P^ZA 

From 

2(X  -  F)2  =  2X2  -  2SXF  +  27^ 
we  obtain,  substituting  values  for  SX^  and  27^  above, 

2XF  =  ^(^  +         +  1)  _  2(X  -  F)^ 

6  2 

Now,  substituting  in  (7')  page  245,  the  values  found,  we  obtain 
after  simplifying 

p^y  =  r,y  (rank)  =  1  -  ~  ^  (20) 

Thus,  after  our  data  are  ranked  the  computation  of  pxy  is  de- 
cidedly simple.  To  illustrate  the  use  of  formula  (20)  let  us  return 
to  the  height-weight  data.  We  have  the  following  table  with  headings 
suitable  to  the  use  of  formula  (20). 

Rank  in  Height  and  Weight  op  Five  Boys 


Boy 

Rank  in  Height 
X 

Rank  in  Weight 
Y 

X  -  F 

(X  -  YY 

A 

1 

2 

-  1 

1 

B 

2 

1 

1 

1 

C 

3 

3 

0 

0 

D 

4 

5 

-  1 

1 

E 

5 

4 

1 

1 
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n  =  5  S(X  ^  r)2  =  4 

_  6(4)      _  24  _4^ 

P^y-^--  5(25  -  1)  "  ^     5(24)  ~  5  ^'^^ 

EXERCISES 

1.  Ten  examination  papers  in  algebra  were  read  by  two  judges  and 
ranked  according  to  merit.  The  following  table  shows  the  results  of  the 
rankings.  Find  pxy- 


Examination 
Paper 

Rank  by 
Judge  No.  1 
X 

Rank  by 
Judge  No.  2 
Y 

1 

6 

5 

2 

2 

1 

3 

4 

6 

4 

8 

9 

5 

1 

2 

6 

3 

3 

7 

7 

7 

8 

5 

4 

9 

10 

8 

10 

9 

10 

2.  The  following  table  gives  the  ranks  of  10  salesmen  by  the  sales 
manager  of  a  corporation  and  also  the  ranks  of  the  10  salesmen  on  a 
psychological  test.  Find  pxy- 


Salesman 

Rank  by 
Sales  Manager 

Rank 
on  Test 

Jones 

1 

1 

Smith 

2 

3 

Brown 

3 

2 

Kelly 

4 

6 

Sanders 

5 

7 

Benson 

6 

4 

Owens 

7 

8 

Miller 

8 

5 

Borden 

9 

9 

Peterson 

10 

10 

3.  From  the  following  table,  by  the  method  of  ranks  find  the  correla- 
tion between  the  grades  in  Test  I  and  Test  II ;  between  the  grades  in  Test  I 
and  Test  III;  between  the  grades  in  Test  II  and  Test  III. 
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Grades  of  21  Students  in  Three  Tests  in  Integral  Calculus 


Student 

Test  I 



Test  II 



Test  III 

Student 

Test  I 

Test  II 

Test  III 

1 

on 

45 

00 

1  o 

yy 

o7 

Oft 

yy 

o 

bU 

OU 

on 

OKJ 

/U 

/  Z 

Q 
O 

Q1 

oi 

yo 

1  A 
14: 

yo 

QQ 
OO 

yj 

4 

93 

85 

90 

15 

97 

95 

93 

5 

87 

80 

70 

16 

34 

55 

30 

6 

95 

90 

100 

17 

96 

96 

96 

7 

74 

60 

60 

18 

74 

20 

40 

8 

61 

79 

85 

19 

62 

72 

75 

9 

92 

82 

71 

20 

63 

94 

91 

10 

67 

84 

97 

21 

88 

78 

94 

11 

100 

86 

98 

69.   CORRELATION  AND  CAUSATION 

The  correlation  coefficient,  as  we  have  used  the  term,  is  a  mathe- 
matical expression  which  measures  the  mathematical  relationship  — 
based  upon  linear  regression,  or  the  best-fitting  straight  Une  to  the 
data  —  that  exists  between  two  variables  X  and  Y.  It  must  not 
be  supposed  that  a  low  co-  Table  61 

efficient  of  correlation  proves  a 
lack  of  relationship  between 
the  two  variables.  Consider 
the  data  of  the  Table  61.  We 
note  that  for  these  data: 
Mx  =  0 

and 

XXY  =  0 

Hence  by  equation  (7) : 

r  =  0 

That  is,  based  upon  the  best-fitting  straight  line  the  data  show  a  very 
poor  relationship  or  a  straight  line  of  very  poor  fit. 

But  based  upon  the  semicircle,  Y  =  +  V25  —  we  have  perfect 
correlation,  since  each  point  is  on  the  curve.  This  simple  illustration 
emphasizes  a  fact  that  we  should  keep  in  mind,  namely,  that  the 
Bravais-Pearson  cross-product  formula  is  based  upon  straight-line 
regression. 


X 

Y 

XY 

0 

5 

00 

3 

4 

12 

4 

3 

12 

5 

0 

00 

-  3 

4 

-  12 

-  4 

3 

-  12 

-  5 

0 

00 

0 

19 

00 
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It  should  also  not  be  supposed  that  the  existence  of  high  coefficient 
of  correlation  between  two  variables  proves  any  necessary  and  inherent 
causal  relationship  between  the  two  —  that  is,  that  one  is  the  abso- 
lute cause  of  the  other.  Consider  the  following  table: 


Table  62  ^ 


Year 

A 

V 

I 

Yi 
-A* 

V2 
I 

YV 
A.  I 

1870 

38 

30 

1,444 

900 

1,140 

1875 

55 

38 

3,025 

1,444 

2,090 

1880 

56 

51 

3,196 

2,601 

2,856 

1885 

73 

69 

5,329 

4,761 

5,037 

1890 

92 

97 

8,464 

9,409 

8,924 

1895 

114 

114 

12,996 

12,996 

12,996 

1900 

138 

135 

19,044 

18,225 

18,630 

1905 

177 

169 

31,329 

28,561 

29,913 

1910 

254 

205 

64,516 

42,025 

52,070 

Total 

997 

908 

149,283 

120,922 

133,656 

Applying  formula  (7)  we  obtain 

r  =  0.98 

which  is  so  astoundingly  large  that  we  are  tempted  to  believe  that 
we  have  a  direct  and  dependent  cause-and-effect  relationship.  As  a 
matter  of  fact 

X  =  the  total  salaries  paid  school  superintendents  and  teachers  in 
millions  of  dollars 

and 

Y  =  the  total  consumption  of  wines  and  Uquors  in  the  United 
States  in  ten  million  gallons 
for  the  given  years. 

This  illustration  shows  almost  perfect  correlation,  yet  no  one 
believes  that  the  consumption  of  wines  and  liquors  increased  neces- 
sarily because  teachers'  salaries  were  increasing,  nor  that  teachers* 
salaries  were  increasing  necessarily  because  more  wines  and  liquors 
were  being  consumed. 

A  high  coefficient  of  correlation  proves  a  close  linear  mathematical 
relationship  between  the  two  variables.   It  proves  nothing  more.  It 

*  The  data  are  from  Statistical  Abstract  of  the  United  States,  1918,  pp.  830 
and  835. 
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suggests  the  probability  of  a  cause-and-effect  relationship  between 
the  two  variables,  but  the  investigator  must  search  further  for  the 
explanation.  Measurement  of  correlation  is  one  part  of  the  problem; 
interpretation  of  the  results  is  a  more  difficult  part  of  the  problem.^ 
Before  the  subject  of  statistical  analysis  had  reached  its  present 
development,  John  Stuart  Mill  stated  in  his  Logic: 

Whatever  phenomenon  varies  in  any  manner  whenever  another  phe- 
nomenon varies  in  some  particular  manner,  is  either  a  cause  or  an  effect 
of  that  phenomenon,  or  is  connected  with  it  through  some  fact  of  causation?- 

The  suggestion  in  the  last  clause  of  MilFs  statement  may  assist 
us  in  explaining  the  above  paradoxical  relationship  between  teachers' 
salaries  and  the  consumption  of  wines  and  liquors.  The  period  from 
1870  to  1910  was  one  of  rapid  development  in  the  United  States. 
Population  increased  rapidly;  foreign  and  domestic  commerce, 
agriculture,  and  the  manufacturing  industries  grew  by  leaps  and 
bounds.  The  total  amounts  paid  for  the  salaries  of  school  superin- 
tendents and  school-teachers  and  the  total  amount  of  mnes  and 
liquors  consumed  merely  kept  step  with  the  development  in  other 
lines.  As  a  matter  of  fact,  we  are  not  at  all  astonished  that  the  two 
do  show  a  surprisingly  large  coefficient.  We  term  such  correlation 
'^spurious." 

In  the  interpretation  of  the  coefficient  of  correlation  it  is  better 
not  to  consider  it  as  a  measure  of  causal  dependence  but  rather  to 
consider  it  as  a  mathematical  expression  for  the  degree  of  association 
between  the  factors.   In  tliis  regard  Professor  Chaddock  says. 

Therefore,  we  no  longer  search  for  cause  and  effect  relations  as  fixed  and 
unvarying  laws.  Association  or  correlation  between  occurrences  tends  to 
replace  the  older  idea  of  causation  in  scientific  investigation.  We  have 
seen  that  variation  is  a  universal  characteristic  of  phenomena.  We  can 
secure  relative  likeness  in  phenomena  by  a  process  of  classification  which 
places  similar  things  together  and  disregards  minor  variations.  The  problem 
of  science  is  to  find  out  how  the  variation  in  one  group  of  facts  is  associated 
with  or  contingent  upon  the  variation  in  other  groups,  and  to  measure  the 
degree  of  the  association. 

The  aim  is  to  find  the  series  of  facts  which  are  most  closely  correlated  in 
order  to  enable  the  investigator  to  predict  future  experience.  Causation 

^  See  Rietz  and  others,  op.  cit.,  p.  138. 

'  Book  III,  Chap.  VIII,  Sect.  6.   (Italics  my  own.) 
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becomes  a  descfipiive  concept  reached  by  statistical  processes  applied  to  the  facts 
of  experience} 

As  a  final  word  we  wish  to  reemphasize  that  the  preceding  chapter, 
Linear  Trends,  and  this  chapter,  Simple  Correlation,  have  been 
concerned  with  the  problem  of  expressing  the  relationship  between 
the  sets  of  data  by  means  of  linear  regression.  We  have  assumed  Y 
to  be  a  linear  function  of  a  single  independent  variable  X  —  or  X  to 
be  a  linear  function  of  F.  The  close  restrictions  imposed  necessarily 
limit  the  range  of  application  of  the  method.  As  an  illustration,  in 
considering  the  problem  of  July  rainfall  in  Ohio  and  its  effect  upon 
the  yield  of  corn  in  that  state  —  Exercise  3,  page  243  —  the  thought- 
ful student  must  have  wondered  about  the  effect  of  other  natural 
causes,  such  as  the  rainfall  for  May,the  rainfall  for  June,  the  tempera* 
tures  for  May,  June,  July,  and  August.  And  well  he  may  wonder. 
The  yield  of  corn  may  be  considered  as  a  function  (or  effect)  of  the 
several  variables  (or  causes)  mentioned.  A  study  of  problems  of  this 
character  in  which  the  dependent  variable  is  a  linear  function  of 
several  independent  variables  belongs  to  the  subject  of  multiple  correla- 
tion, whereas  problems  in  which  the  dependent  variable  is  a  linear 
function  of  a  single  independent  variable  belong  to  the  subject  of  simple 
correlation.  The  subject  of  multiple  correlation  is  treated  in  Chap- 
ter 9.  If  the  reader  desires  he  may,  without  loss  of  continuity,  begin 
its  study  now;  or  he  may  defer  it. 

Further,  we  may  consider  that  the  relationship  between  the  de- 
pendent variable  and  the  single  independent  variable  can  be  described 
by  some  simple  curve  other  than  a  straight  line.  Such  correlation 
based  upon  curvilinear  regression  will  be  considered  in  Chapter  10. 

EXERCISES 

1.  For  the  Water  Depth-Alfalfa  Yield  data  of  Exercise  2,  page  243,  the 
following  is  a  summary: 

Mx  =  33.75  inches       My  =  7.25  tons       m  =  0.075 
ffx  =  14.98  inches        <7y  =  1.26  tons        r  =  0.89 

(1)  Find  the  equation  of  the  regression  line  of  Y  on  X. 

(2)  Is  the  value  of  r  sufficiently  large  to  warrant  confidence  in  the  re- 
gression line  for  purposes  of  estimation? 

(3)  Find  Y  in  (1)  if  X  -  40. 

(4)  Find  Sy  and  interpret  your  result  for  the  value  found  in  (3). 

1  R.  E.  Chaddock,  Principles  and  Methods  in  Statistics,  1925,  p.  250. 
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2.  Find  the  correlation  of  the  yield  of  a  plant  of  oats  with  the  number  of 
kernels  per  plant  for  the  data  of  the  accompanying  table. 

X  —  the  number  of  kernels  per  plant.    Y  =  the  yield  in  grams. 

Kernels  per  Plant  ^ 


>< 

25 

75 

125 

175 

225 

275 

325 

375 

425 

475 

8.5 

1 

7.5 

1 

1 

6.5 

3 

4 

5.5 

12 

26 

4 

4.5 

4 

30 

39 

7 

3.5 

4 

47 

51 

7 

2.5 

61 

45 

1.5 

20 

30 

0.5 

2 

1 

«3 

S 

& 

.9 

•  i-H 


3.  The  following  table  is  a  correlation  table  for  the  lengths  and  the 
breadths  of  60  leaves.   X  =  breadths  and  Y  =  lengths,  in  millimeters.^ 

Breadths 


16 

19 

22 

25 

28 

31 

34 

52 

1 

1 

47 

2 

3 

1 

1 

42 

1 

3 

5 

3 

37 

2 

5 

4 

3 

32 

1 

3 

4 

3 

2 

27 

1 

3 

3 

1 

22 

1 

2 

1 

Find  r  and  the  regression  lines  for  the  data. 

4.  Find  r  for  the  Savings  Bank  Deposits-Strikes  and  Lockouts  data  of 
page  236.  Is  this  value  of  r  sufficiently  large  to  warrant  your  using  with 
confidence  the  regression  equations  for  purposes  of  estimation? 

6.  As  in  Exercise  4  above,  treat  the  Value  of  Crops-Value  of  Land 
(Illinois)  data  of  page  236. 

6.  Similarly,  treat  the  Value  of  Crops-Value  of  Land  (Iowa)  data  of 
page  237. 

1  The  data  are  from  A.  S.  Gale  and  C.  W,  Watkeys,  Elementary  Funciion9  and 
Applications f  1920,  p.  432. 

*  Gavett,  First  Course  in  Statistical  Method,  p.  234. 
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7.  The  following  correlation  table  gives  the  scores  of  104  freshmen  at 
Georgetown  College.  X  =  scores  in  mathematics.  Y  =  scores  in  intelli- 
gence. 

Scores  in  Intelligence  and  Mathematics  Tests  of  104  Students 

Mathematics 


2.5 

7.5 

12.5 

17.5 

22.5 

27.5 

32.5 

37.5 

42.5 

47.5 

52.5 

57.5 

145 

1 

1 

1 

1 

135 

1 

2 

2 

1 

1 

1 

125 

1 

5 

2 

3 

4 

1 

1 

1 

115 

1 

2 

1 

5 

1 

3 

2 

2 

1 

1 

105 

3 

1 

4 

1 

4 

2 

4 

95 

4 

2 

2 

4 

2 

3 

1 

85 

1 

3 

3 

2 

1 

75 

1 

2 

1 

1 

1 

65 

2 

♦  l-H 


Find  r  and  the  regression  lines  for  the  data. 

8.  In  an  investigation  of  the  resemblance  of  fathers  and  sons  with 
respect  to  stature,  the  following  summary  was  obtained: 


Stature  of  fathers 
X 

Mx  =  67.7  inches 
ax  =  3.21  inches 


Stature  of  sons 
Y 

My  =  68.7  inches 
<ry  =  2.71  inches 


0.51 


What  is  the  most  probable  height  of  the  sons  of  a  group  of  selected 
fathers  whose  mean  height  is  6  feet?  Discuss  the  reliability  of  this  estimate 
by  means  of  Sy. 

9.  Are  the  following  correlations  positive  or  negative? 

(1)  The  speed  of  an  auto  and  the  distance  required  to  bring  the  car 
to  rest  when  the  brakes  are  applied. 

(2)  Age  of  applicants  for  life  insurance  and  cost  of  insurance. 

(3)  Age  of  an  automobile  and  its  trade-in  value. 

(4)  Family  income  and  cost  of  the  family  car. 

(5)  Marriage  rate  and  index  of  unemployment. 

(6)  Age  and  blood  pressure. 

(7)  Age  of  husbands  and  age  of  wives. 

(8)  Index  of  unemployment  and  amount  of  goods  purchased. 

(9)  The  soot  content  in  the  air  at  Pittsburgh  and  the  production  of 
pig  iron. 
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(10)  Total  production  of  wheat  and  the  average  farm  price  per  bushcL 

(11)  Per  cent  illiteracy  and  the  per  cent  foreign  population  in  the 
counties  in  Pennsylvania. 

(12)  Crime,  as  measured  by  the  number  of  indictable  offences  tried, 
and  the  index  of  unemployment. 

(13)  Value  of  crops  per  acre  and  the  value  of  land  per  acre  in  Illinois. 

(14)  Amount  of  savings  deposits  and  the  number  of  strikes  and  lock-outs 
in  the  United  States. 

(15)  Number  of  hogs  slaughtered  per  month  at  Chicago  and  monthly 
price  of  pork  at  Chicago. 

(16)  Marriage  rate  and  the  index  of  industrial  activity.  (See  Groves 
and  Ogburn:  American  Marriage  and  Family  Relationships,  Chap- 
ter XVIII.) 

(17)  Scholarship  and  success  in  life.  (See  Gifford:  Does  Business 
Want  Scholars?      Harpers,  May,  1928.) 

10.  In  the  following  table  (F.  C.  Mills:  Statistical  Methods,  p.  381) 

X  =  Federal  Reserve  Banks'  Discount  Rates  (per  cent). 
Y  =  Commercial  Banks'  Discount  Rates  (per  cent). 


Federal  Reserve  Banks'  Discount  Rates  (per  cent) 


Pi 
O 

o 

•  F— «  ^  

o 

(D 

a 
a 

o 
O 


\.  X 
Y 

4.00 

4.50 

5.00 

5.50 

6.00 

6.50 

7.00 

8.00 

1 

1 

2 

7.50 

7 

9 

1 

7.00 

5 

4 

63 

9 

36 

6.50 

2 

9 

10 

22 

1 

3 

6.00 

1 

90 

29 

6 

30 

5.50 

11 

110 

5 

5.00 

10 

24 

4.50 

2 

1 

(1)  Choose  (h,  k)  at  (5.50,  6.50),  compute  r  and  the  regression  of  Y  on  X. 

(2)  Find  the  estimated  value  of  F  if  X  =  5.00. 

(3)  Compute  the  arithmetic  mean  of  the  F-array  for  X  =  5.00  and 
compare  with  the  value  found  in  (2). 

(4)  Find  the  regression  equation  of  -X  on  Y, 

(5)  Find  the  estimated  value  of  X  if  7  =  7.00. 

(6)  Find  the  arithmetic  mean  of  the  X-array  for  Y  =  7.00  and  compare 
with  the  value  found  in  (4). 
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(7)  Find  Sp  and  Sx  for  the  estimated  values  in  (2)  and  (5)  and  interpret 
them. 

11.  The  data  in  the  following  table  were  taken  from  the  Handbook  of 
Labor  StatisticSf  1936  Edition,  pages  132  and  673. 

X  =  Index  of  Wholesale  Prices  in  the  United  States.    (U.S.  Dept.  of 

Labor.  Monthly  Average,  1926  =  100.) 
Y  =  General  Index  of  Employment.    (U.S.  Dept.  of  Labor.  3-year 

average,  1923-1925  =  100.) 
Compute  r. 


Year 

X 

Y 

Year 

X 

Y 

Year 

X 

Y 

1919 

139 

107 

1925 

104 

99 

1931 

73 

77 

1920 

154 

108 

1926 

100 

101 

1932 

65 

64 

1921 

98 

82 

1927 

95 

99 

1933 

66 

69 

1922 

97 

91 

1928 

97 

99 

1934 

75 

79 

1923 

101 

104 

1929 

95 

105 

1935 

80 

82 

1924 

98 

97 

1930 

86 

92 

12.  The  following  data  are  taken  from  the  Yearbook  of  Agriculture^ 
1935,  pp.  363-364. 

X  —  supply  of  wheat  in  the  U.S.,  July  1. 
Y  =  price  of  wheat  at  Chicago. 


Year 

Supply 
{million  bu.) 
X 

Price 
{cents) 
Y 

Year 

Supply 
{million  bu,) 
X 

Price 
{cents) 
Y 

1919 
1920 
1921 
1922 
1923 

1924 
1925 
1926 

77 
145 
126 
114 
137 

144 
115 
105 

227 
216 
128 
113 
106 

139 
161 
140 

1927 
1928 

1929 
1930 
1931 
1932 
1933 

122 
124 

247 
303 
326 
385 
393 

138 
117 

130 
84 
53 
53 
94 

(1)  Draw  a  chart  for  these  data  similar  to  Chart  6,  p.  48. 

(2)  Compute  r  and  interpret  it. 

(3)  Compute  m  and  interpret  it. 

(4)  Write  the  equation  of  the  regression  line  of  Y  on  X. 

(5)  Find  the  estimated  values  of  F  if  X  =  100,  200,  and  300. 

(6)  Find  Sy  of  the  estimates,  and  interpret. 

13.  The  following  table  gives  the  average  number  of  kernels  per  culm 
per  oat  plant  and  the  average  height  of  the  oat  plants  (Love-Leighty). 
Find  r. 
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NuMBEE  OP  Kernels 


\  X 

35 

45 

55 

66 

75 

85 

95 

105 

115 

125 

1  — 

87.5 



3 

2 

2 

7 

1 
1 

1  9 

OA 
ZD 

Zo 

/O 

77.5 

2 

16 

40 

38 

23 

3 

122 

72.5 

1 

13 

30 

59 

32 

5 

140 

67.5 

7 

22 

9 

6 

1 

45 

62.5 

4 

7 

11 

57.5 

1 

1 

2 

fix) 

1 

4 

16 

37 

56 

117 

97 

54 

14 

4 

400 

14.  In  the  following  table: 

X  =  price  per  bushel  in  cents  received  by  producers  December  1  for 
corn 

Y  =  price  per  bushel  in  cents  received  by  producers  December  1  for 
wheat 


Find  r  and  discuss  its  significance.  Would  you  say  this  correlation  is 
spurious? 


Price  of  Corn  and  Price  of  Wheat  in  the  United  States,^  1909-1928 


Year 

Y 

Year 

X 

y 

1909 

58.6 

98.4 

1919 

134.5 

214.9 

1910 

48.0 

88.3 

1920 

67.0 

143.7 

1911 

61.8 

87.4 

1921 

42.3 

92.6 

1912 

48.7 

76.0 

1922 

65.8 

100.7 

1913 

69.1 

79.9 

1923 

72.6 

92.3 

1914 

64.4 

98.6 

1924 

98.2 

129.9 

1915 

57.5 

91.9 

1925 

67.4 

141.6 

1916 

88.9 

160.3 

1926 

64.2 

119.8 

1917 

127.9 

200.8 

1927 

72.3 

111.5 

1918 

136.5 

204.2 

1928 

75.1 

97.2 

16.  If  X  =  Income  in  dollars  per  capita  in  Texas  in  1932, 

Y  =  Retail  sales  in  dollars  per  capita  in  Texas  in  1932, 
r  =  0.875,    and    m  ^  0.746, 
(1)  Comment  on  the  estimative  value  of  the  line  of  regression 
Y  =  0.746X  +  8.33. 


1  The  data  are  from  Yearbook  of  Agriculture,  1928,  pp.  670  and  702. 
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(2)  If  X  =  $175,  compute  F,  and  compare  with  the  observed  value, 
$133. 

(3)  If  X  increases  $1.00,  what  is  the  expected  change  in  F? 
16.  In  the  following  table: 

X  —  scores  of  32  students  on  the  Bucknell  test  in  intermediate  algebra. 
Y  =  scores  of  the  same  students  on  a  standardized  test  in  intermediate 
algebra. 

Z  =  the  semester  grades  of  the  same  students  in  intermediate  algebra. 


X 

Y 

Z 

X 

Y 

Z 

X 

Y 

Z 

X 

Y 

Z 

54 

56 

67 

90 

94 

91 

27 

46 

35 

88 

95 

90 

55 

64 

67 

63 

79 

77 

78 

54 

76 

72 

70 

82 

64 

67 

74 

43 

56 

60 

10 

19 

20 

55 

59 

61 

33 

43 

48 

69 

48 

70 

49 

39 

60 

61 

68 

76 

57 

55 

60 

47 

48 

52 

46 

58 

50 

33 

52 

50 

42 

59 

60 

62 

59 

67 

70 

41 

62 

65 

45 

65 

88 

84 

81 

92 

64 

90 

45 

37 

50 

84 

82 

88 

85 

84 

86 

75 

68 

85 

95 

99 

92 

55 

60 

52 

Verify  the  following  analysis: 

Mx  =  61  My  =  61  Mz  =  67 

<Tx  =  20.4  <7y  =  1V.9  (Tz  =  17.0 

rxY  =   0.78  rxz  =   0.94  ryz  =  0.84 

Which  test  was  given  the  greater  weight  in  the  determination  of  the  stu- 
dents' semester  grades? 

17.  The  following  data  are  taken  from  the  1935  World  Almanac ,  pp.  479, 
499. 

Column  11  gives  the  average  attendance  (in  thousands)  in  New  York 
City  schools  for  the  given  years. 

Column  III  gives  the  number  (in  thousands)  arraigned  before  the 
Magistrates  Courts  in  New  York  City  in  the  same  years. 

Find  pxY  or  Txy  (rank)  for  these  data.  Would  you  say  that  this  correla- 
tion is  spurious?  Explain. 


Year 

Col.  II 

Col.  Ill 

Year 

Col.  II 

Col.  Ill 

1918 

700 

202 

1923 

853 

420 

1919 

712 

282 

1924 

870 

455 

1920 

736 

355 

1925 

891 

440 

1921 

779 

367 

1926 

910 

437 

1922 

814 

434 

1927 

926 

527 

Chapter  9 


MULTIPLE  CORRELATION 

70.   PRELIMINARY  EXPLANATION 

Our  previous  work  in  correlation  has  been  concerned  with  problems 
involving  only  two  variables,  an  independent  variable  X  and  a  de- 
pendent variable  Y.  Such  correlation  is  called  ^'bivariate.'^  It  is 
obvious  that  many  types  of  phenomena  are  affected  by  more  than 
one  factor  and  that  the  variations  in  the  dependent  variable  may 
be  due  to  the  interaction  of  many  forces. 

In  bivariate  correlation  we  measure  the  relationship  between  the 
dependent  variable  Y  and  a  single  independent  variable  X,  com- 
pletely ignoring  the  influence  upon  Y  of  other  forces  that  may  be 
just  as  potent  as  X.  Thus,  on  page  243  we  measured  the  influence 
of  July  rainfall  X  upon  the  production  of  corn  in  Ohio  Y.  We  found  r 
to  be  O.Gl  which  shows  that  July  rainfall  does  exert  a  significant 
influence  upon  the  production  of  corn.  But  we  may  wonder  if  it 
exerts  a  greater  influence  than  June  rainfall  or  June  temperature  or 
July  temperature.  We  are  thus  aware  that  the  production  of  corn 
may  be  dependent  upon  several  variables,  and  a  consideration  of  the 
production  in  this  regard  would  present  a  problem  in  multiple  correla- 
tion. Multiple  correlation  is  then  concerned  with  the  combined  influence 
of  several  independent  variables  upon  a  single  dependent  variable. 

As  another  illustration,  suppose  we  have  the  scores  made  by  a 
group  of  students  on  objective  tests  in  English,  Mathematics,  and 
Intelligence.  By  means  of  simple  correlation  we  can  measure  the 
relationship  between  the  scores  in  Intelligence  and  those  in  Mathe- 
matics, between  the  scores  in  Intelligence  and  those  in  English,  and 
between  the  scores  in  English  and  the  scores  in  Mathematics.  What 
we  now  need  is  a  method  of  combining  two  factors,  say  English  and 
Mathematics,  in  order  that  an  estimate  may  be  made  of  their  in- 
fluence in  combination  upon  the  third  factor.  Intelligence. 

The  method  of  procedure  by  which  this  may  be  accomplished  is 

similar  to  that  used  in  simple  correlation. 
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71.   THE  CASE  OF  THREE  VARIABLES 

Let  us  assume  that  Xiy  X2,  Xz  are  three  variable  quantities  which 
represent  three  interacting  forces.  Any  one  variable  may  be  con- 
sidered mathematically  a  function  of  the  other  two.  As  in  the  case 
of  bivariate  correlation,  we  shall  assume  that  the  relationships  are 
linear,  that  is,  that  the  A''  observed  points  representing  the  observed 
sets  of  data  are  distributed  about  the  plane 

Xi  =  612X2  +  613X3  +  c  (1) 

in  which  X2  and  Xz  are  independent  variables  and  Xi  is  the  de- 
pendent variable.^ 

We  shall  determine  the  constants  in  accordance  with  the  Principle 
of  Least  Squares:  The  plane  best  fitting  a  set  of  points  is  that  one 
in  \vhich  the  constants  are  so  determined  that  the  sum  of  the  squares 
of  the  Xi-re^iduals  is  a  minimum. 

An  Xi-residual  is  defined  by  the  equation 

p  =  Xi-  (612X2  +  613X3  +  c)  (2) 

We  shall  determine  612,  613,  and  c  so  that 

2p2  =  2[Xi  ~  (612X2  +  613X3  +  c)]2  (3) 

shall  be  a  minimum.  The  conditions  for  this  are  that  the  first  partial 
derivatives  of  2p^  with  respect  to  c,  612,  and  613  shall  be  equal  to  ^ero. 
Equating  to  zero  these  derivatives,  we  obtain  the  norrhal  equations 

612SX2  +  61^2X3  +  Nc  =  SXi  ] 

6i2SX|  +  618SX2X3  +  CSX2  =  SX1X2  \  (4) 

6i2SX2X^  +  613SXI  +  cSXg  -  SXiXa^ 

from  which,  by  simultaneous  solution,  the  values  of  613, 613,  and  c  may 
be  determined  in  terms  of  the  observed  values  Xj,  Xj,  Xj. 
Thus,  suppose  we  wish  to  find  the  plane 

Xi  =  612X2  +  613X3  +  c 

that  best  fits  the  tm  points  (Xi^  X9,  Xj)  given  in  Table  63. 

^  The  first  subscript  affixed  to  the  regression  coefficient  6»j  will  be  the  sub- 
script of  the  letter  X  on  the  left  (the  dependeiit  Variable),  and  the  secoiid  Will 
be  the  subscript  of  the  X  to  which  it  Id  attACh^d. 
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Table  63 


X^ 
I 

X? 

2 

2 

11 

4 

4 

121 

JL  mm  JL 

22 

22 

4 

3 

4 

JL. 

17 

9 

16 

289 

68 

51 

12 

4 

6 

26 

16 

36 

676 

156 

104 

24 

5 

6 

28 

25 

25 

784 

140 

140 

25 

6 

8 

31 

36 

64 

961 

248 

186 

48 

7 

7 

35 

49 

49 

1,225 

245 

245 

49 

9 

10 

41 

81 

100 

1,681 

410 

369 

90 

10 

11 

49 

100 

121 

2,401 

539 

490 

110 

11 

13 

63 

121 

169 

3,969 

819 

693 

143 

13 

14 

69 

169 

196 

4,761 

966 

897 

182 

70 

80 

370 

610 

780 

16,868 

3,613 

3,197 

687 

Ms  ^7 

M2  =  8 

Ml  =  37 

We  complete  the  table  to  find  the  2  functions  that  we  need  in 
the  normal  equations  (4).   Substituting  in  (4)  we  obtain 

8O618  +  706i3  +  10c  =  370 
7806i2  +  6876i3  +  80c  =  3613 
6876i2  +  6IO613  +  70c  =  3197 

To  solve  these  equations  we  divide  each  equation  by  the  coefficient 
of  612  of  that  equation.   We  obtain 

612  +  .8756i3  +  .125c  =  4.625 
6x2  +  .88I613  +  .103c  =  4.632 
612  +  .888613  +  .102c  -  4.654 

Next  we  subtract  the  first  equation  from  the  second  and  the 
second  equation  from  the  third.   We  obtain 

.OO6613  -  .022c  =  .007 
.0076i3  -  .001c  =  .022 

or,  multiplying  by  1,000 

6613  -  22c  =  7 
7bi3  -  c  =  22 

Solving  these  equations  and  substituting  we  obtain  612  =  1.735, 
613  ==  3.223,  c  =  0.561.   The  equation  of  the  best  fitting  plane  is 

Xi  =  1.735X2  +  3.223X3  +  0.561 

Exercise.  Show  that  the  point  (Af  1,  M?,  M^)  =  (37,  8,  7)  ia  on  this  plane. 
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We  may  test  the  goodness-of-fit  of  the  plane  by  finding  the  com- 
puted values  of  Xi  for  the  given  values  of  X2  and  Z3,  and  the  Xy 
residuals.  The  computed  values  of  Xi,  the  (Xi-residuals),  and  the 
(Xr residuals)  2  are  shown  in  Table  64. 


Table  64 


X2 

Computed 
Xi 

Xi-residuals 
P 

{Xi-residuaUy 
P' 

2 

2 

11 

10.477 

+  0.523 

.274 

3 

4 

17 

17.170 

~  0.170 

.029 

4 

6 

26 

23.863 

+  2.137 

4.567 

5 

5 

28 

25.351 

+  2.649 

7.017 

6 

8 

31 

33.779 

-  2.779 

7.723 

7 

7 

35 

35.267 

-  0.267 

.071 

9 

10 

41 

46.918 

-  5.918 

35.023 

10 

11 

49 

51.876 

-  2.876 

8.271 

11 

13 

63 

58.569 

+  4.431 

19.634 

13 

14 

69 

66.750 

+  2.250 

5.062 

-  0.020 

87.671  =  2p2 

We  note  that  five  points  are  above  the  plane  and  five  points  are 
below  it,  and  that  the  sum  of  the  residuals  is  essentially  zero.  The 
sum  of  the  squares  of  the  Xi-residuals,  Sp^,  plays  a  role  in  multiple 
correlation  similar  to  that  played  by  Sp^  in  simple  correlation. 
(See  p.  233.)  It  assists  us  in  finding  the  standard  error  of  estimate, 
*Si(23).  As  we  did  in  simple  correlation,  we  define  the  standard  error 
of  estimate  by  the  equation  ^ 

^^1(23)  -  y 

This  is  a  quantity  which,  when  combined  with  the  computed  value  of 
Xij  makes  possible  our  measuring  the  confidence  or  the  reliability 
we  may  place  in  values  of  Xi  estimated  from  the  equation  for  given 
values  of  X2  and  X3.  Thus,  the  odds  are  2  to  1  that,  for  given 
values  of  X2  and  X3,  the  observed  Xi  will  lie  within  the  interval 

(computed  Xi)  =fc  Su2Z) 

^  The  subscript  before  the  parenthesis  designates  the  variable  estimated  (the 
dependent  variable)  and  the  subscripts  within  the  parentheses  designate  the 
variables  from  which  the  estimate  has  been  made. 
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Similarly,  the  odds  are  19  to  1  that  the  observed  Xi  will  lie  within 

(computed  Xi)  ±  2aSi(23) 
and  385  to  1  that  the  observed  Xi  will  lie  Avithin 

(computed  Xi)  ±  3>Si(23) 
For  the  problem  we  are  considering 


Si(23)  =  S/^^  =  2.689 


It  will  be  noted  that  only  two  of  the  ten  points  have  residuals  nu- 
merically larger  than  2.689,  and  only  one  point  has  a  residual  numer- 
ically larger  than  2(2.689). 

In  a  later  section  we  will  discuss  the  coefficient  of  multiple  correlation 
which  is  an  expression  that  measures  the  degree  of  the  relation 
between  a  single  dependent  variable,  say  Xi,  and  several  inde- 
pendent variables,  Xo  and  X3,  in  coml)ination.  We  shall  show  that 
this  coefficient  i2i(23)  may  be  found  from  the  formula 


Rl(23)  -  \/  ^  


SI 

where  ai  means  (Tx^- 


From  Table  63  we  find 


Hence 


(23) 


=  -  Ml  =  y/i^  -  37^  =  17.83 

=  \J  1  -  =  Vl  -  .0275  =  Vr9"725  -  0.986 


Thus,  we  have  completed  the  analysis  of  the  data  of  Table  63. 
This  analysis  has  included  finding  (1)  the  best  fitting  plane,  (2)  the 
standard  error  of  estimate,  and  (3)  the  coefficient  of  multiple  corre- 
lation between  Xi  and  (X2  and  X3)  in  combination. 


EXERCISES 

1.  For  the  values  of  612,  613,  and  c  determined  by  (4),  show  that 

(a)  the  algebraical  sum  of  the  Xi-residuals  is  equal  to  zero,  and  that 

(b)  the  point  (Mi,  M2,  Ms)  is  on  (1). 
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Note.  The  quantities  Mi,  Ms,  and  M3  are  the  means  of  the  variables 
Xi,  Xz,  and  Xz  respectively. 


2. 


Y. 

10 

2 

5 

4 

7 
1 

17 

6 

8 

19 

8 

9 

25 

9 

12 

22 

10 

10 

26 

11 

13 

31 

12 

15 

30 

13 

14 

35 

15 

17 

(1)  Find  the  regression  equation  for  these  data 
with  Xi  as  dependent  upon  Xt  and  Xg, 

(2)  Find  the  computed  values  of  Xi  for  the  given 
values  of  X2  and  X3. 

(3)  Find  the  Xi-residuals. 

(4)  Find  >Sfi(23)  and  /2i(23). 

(5)  How  many  of  the  points  are  within  (Xi  com- 
puted) =b  Sl(23)? 


72.  THE  CASE  OF  THREE  VARIABLES  CONTINUED 

Secondary  Explanation 

The  method  employed  in  the  preceding  section  is  satisfactory  when 
the  number,  A^,  of  sets  of  values  is  small,  say  less  than  forty.  When 
A''  is  large,  as  it  usually  is,  we  need  a  more  systematic  procedure. 
Further,  the  development  of  a  theory  in  terms  of  the  original  variates, 
Xi,  X2,  and  Xz  is  rather  complex  and  tedious. 

A  simpler  and  more  elegant  procedure  is  to  show  that  the  centroidal 
point  (Ml,  M2,  Ms)  is  on  the  best-fitting  plane,  then  to  transform  our 
variates  to  this  centroidal  point  as  origin.  (We  shall  indicate  the 
means  of  the  variables  Xi,  X2,  and  X3  by  Mi,  M2,  and  M3  respectively, 
and  their  standard  deviations  by  cri,  0*2,  and  0*3.)  We  shall  prove 
that  (Ml,  M2,  M3)  satisfies  equation  (1)  for  the  values  of  612,  613, 
and  c  determined  by  equations  (4). 

If  the  first  of  equations  (4)  be  divided  by  N  we  have 

or 

^12^2  +  613M3  +  C  ~  Ml  0 

which  is  the  condition  that  (Mi,  M2,  Ms)  is  on  (1). 

We  now  translate  our  data  to  the  centroidal  point  as  origin  and 
take  the  equation  of  the  plane  through  this  point  to  be 
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where 

Xi  =  Xi  —  Mif      X2  ==  X2  —  M2,      X3  «=  X3  —  Mz 
For  this  form  of  the  regression  plane  any  Xi-residual  is  given  by 

P  =  Xl  -  {bi2X2  +  613X3) 

and  by  equating  to  zero  the  first  partial  derivatives  of 

Sp2  =  S[xi  -  (612x2  +  6i3X3)]2  (5) 
with  respect  to  612  and  613;  we  obtain  the  normal  equations 

612SX2  +  6132X2X3  =  SxiX2  1 


612SX2X3  +  613SXI  =  2xiX3  J 


(6) 


Let  (7;  be  the  standard  deviation  of  the  N  values  of  Xy,  and  let 
Tpq  be  the  correlation  coefficient  of  the  N  given  pairs  of  values  of 
Xp  and  Xq.   Thus  Sxi  =  Ncrly  Sxl  ===  Naj,  2xiX2  =  N(Tia2ri2j  SxiXs 

By  expressing  the  summations  in  terms  of  the  standard  deviations 
and  correlation  coefficients,  the  normal  equations  (6)  after  simplifi- 
cation become 

6120*2  +  6i3(r3r23  =  criri2 1 
6i2<T2r23  +  613(73  =  a-iri3  J 

Solving  the  normal  equations  (7)  we  obtain  the  regression  co- 
efficients 


7        ^2  —  ^13^23  cr\ 

O12  =   2 —  — 

1  -  rii  (72 

,        ri3  —  ri2r23  <Ti 

0X8  == 


(8) 


1  -  r^3  <^3, 
and  the  regression  plane  is  thus 

^  (1  -  ^23)  =     (^12  -  ri3r23)  +  ~  (ri3  -  ri2r23)  (9) 

(7i  <J2  <73 

In  terms  of  the  original  variates  Xi,  X?,  X3  the  equation  of  the  re- 
gression plane  is 

(7i  (72  (Tz 
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Equation  (10)  gives  the  most  probable  value  of  Xi  for  assigned 
values  of  X2  and  X3.  Analogous  equations  may  be  written  with  X2 
and  X3  as  the  dependent  variables  by  cyclically 
permuting  the  subscripts  1,  2,  and  3;   that  is,  re-  ^ 
placing  1  by  2,  2  by  3,  and  3  by  1,  as  if  one  were   (  j 
going  around  the  circle  in  the  direction  indicated  >v 
by  the  figure.  ^ 

We  have  thus  reached  a  result  which  gives  an  effective  summary 
of  the  manner  in  which  X2  and  X3  in  combination  affect  Xi.  Further, 
it  is  delightful  to  observe  that  this  summarizing  equation  involves 
nothing  more  complicated  than  sinxple  correlation  coefficients. 


1. 


Xi 

X2 

Xz 

2 

26 

1 

4 

20 

2 

6 

20 

3 

9 

17 

4 

5 

7 

5 

5 

5 

6 

11 

3 

7 

EXERCISES 

(1)  Verify  the  following: 
Ml 


6 

<Ti  =  2.828  • 
ri2  =  -  0.551 


M2  = 
0-2  = 
ri3  = 


14 

8.246 
0.707 


M3 

^23 


=  4 
=  2 

=  -  0.970 


(2)  Find  the  regression  plane  with  Xi  as  depend- 
ent on  X2  and  X3. 

(3)  Find  i^i(23)  and  >Si(23). 


2. 


Xi 

X2 

5 

4 

5 

4 

5 

2 

5 

6 

4 

6 

4 

9 

9 

5 

8 

10 

6 

4 

9 

6 

10 

12 

7 

11 

11 

9 

10 

9 

8 

7 

(1)  Verify  the  following: 

Af  1  =  8  M2  =  6         M3  =  7 

(Ti  =  2.646        0-2  =  1.549         =  2.933 
ri2  ==  .683         ri3  =  .696      r23  =  .374 

(2)  Find  the  regression  plane  with  Xi  as  depend- 
ent on  X2  and  X3. 

(3)  Find  the  computed  values  of  Xi  and  the  Xi- 
residuals. 

(4)  Find  /^i(23)  and  -Si  (23). 

(5)  How  many  of  the  points  are  within  (Xi  com- 
puted) =t  >Si(23)? 


3.  In  the  following  table 

Xi  =  the  semester  grades  of  32  students  in  intermediate  algebra 
X2  =  the  scores  of  the  same  students  on  a  standardized  test  in  inter- 
mediate algebra 

X3  =  the  scores  of  the  same  students  on  the  Bucknell  test  in  intermediate 
algebra 
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X, 

X2 

Xi 

Xa 

X2 

Xi 

Xt 

X2 

Xi 

Xi 

X2 

Xi 

64 

56 

67 

90 

94 

91 

27 

46 

35 

88 

95 

90 

55 

64 

67 

63 

79 

77 

78 

54 

76 

72 

70 

82 

fi7 

74. 

*±o 

ou 

1 Q 

00 

Dv 

Oi 

33 

43 

48 

69 

48 

70 

49 

39 

60 

61 

68 

76 

57 

55 

60 

47 

48 

52 

46 

58 

50 

33 

52 

50 

42 

59 

60 

62 

.59 

67 

70 

41 

62 

65 

45 

65 

88 

84 

81 

92 

64 

90 

45 

37 

50 

84 

82 

88 

85 

84 

86 

75 

68 

85 

95 

99 

92 

55 

60 

52 

(1)  Verify  the  following  values: 

Ml  =  67  M2  =  61  Ms  =  61 

<ri  =  17.0  (T2  =  17.9  (73  =  20.4 

rn  =  0.84  ^3  =    0.94  =  0.78 

(2)  Find  7^1(23)  and  *Si(23). 

(3)  Find  the  equation  of  the  regression  plane  with  Xi  dependent  upon 
X2  and  X3.   Show  that  the  point  (Mi,  M2,  M3)  is  on  this  plane. 

(4)  What  meaning  do  you  attach  to  the  values  of  612  and  613? 

(5)  Estimate  Xi  for  X2  —  84  and  X%  —  81.    Use  your  value  of  Si  (23) 
to  interpret  this  estimate. 

4.  The  following  table  gives  a  summary  of  the  fundamental  statistical 
constants  that  were  obtained  from  scores  made  on  objective  tests  in 
English,  Mathematics,  and  Intelligence  by  343  Bucknell  freshmen. 

(1)  Find  the  equation  for  the  regression  of  Intelligence  on  English  and 
Mathematics. 

(2)  What  is  the  estimated  Intelligence  score  for  an  individual  whose 
English  score  was  172  and  whose  Mathematics  score  was  40? 

(3)  What  is  the  estimated  Mathematics  score  of  an  individual  whose 
English  score  was  160  and  whose  Intelligence  score  was  150? 


Fundamental  Constants  from  343  Scores  in  the  Tests  Given 
IN  English,  Mathematics,  and  Intelligence 


English 

M  athernatics 

Intelligence 

bions 

English 

1.00 

0.30 

0.65 

Mathematics 

0.30 

1.00 

0.46 

I 

0 

Intelligence 

0.65 

0.46 

1.00 

Arithmetic  Means 

151 

34 

140 

Standard  Deviations 

44 

12 

45 
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7S.   COEFFICIENT  OF  MULTIPLE  CORRELATION 

Three  Variables 

It  is  evident  that  the  value  of  equation  (9)  or  (10)  as  a  tool  for 
purposes  of  estimation  depends  upon  the  closeness  of  fit  of  the 
plane  to  the  points.  As  wa^  suggested  in  the  preceding  section  we 
shall  use  for  measuring  the  goodness  of  fit  of  the  plane  to  the  points 
the  standard  error  of  estimate,  Si(23), 


oi(23)  —  y  - 


N 

where       is  determined  from  the  values  of  bu  and  few  in  (7)  or  (8). 
From  (5) 

Sp2  =  S[a:i  —  (bi2X2  +  bisXz)']'^ 

==       +  blz^xl  +  bl^'Zxl  —  26122x1X2  —  2bul^XiX3  +  26126132x2X3 

which  may  be  written  in  the  form 

Sp2  =  Nlal  +  blztrl  +  613^^3  -  26i2cricr2ri2  --  26i3(ri<r3ri3 

+  26i26i3(r2<T8r28]  (11) 

We  desire  the  value  of  2p^  for  the  values  of  612  and  613  given  by  (7) 
or  (8).  This  may  be  easily  found  by  multiplying  the  normal  equa- 
tions (7)  by  612(^2  and  bncrs  respectively,  adding  the  results,  and  sub- 
stituting the  results  in  (11).  The  value  for  2p2  then  becomes 

2p2  =  Ar[or2  —  6i20'i(7'2ri2  —  6i3(ri(r3ri3]  (12) 
If  now  the  values  of  612  and  613  given  by  (8)  are  substituted,  we  have 


Si 


1(23) 


of.  +  r?3  -  2ri2ri3r23l 


or 

where 


Sim     cTiVl  -  iBf(23)  (14) 


7?         A  7^2  +  r?3  -  2ri2ri3r23 

^1(23)  ==  y  ^  _  ^2^   Uo; 

is  the   coefficient  of  multiple  correlation"  of  Xi  on  X2  and  X3. 

By  permuting  the  subscripts  we  may  write  down  the  values  of 
R2(i3)  and  Rza^)-  Due  to  the  fact  that  we  have  no  mathematical 
method  of  attaching  a  meaning  to  the  algebraical  sign  of  /2i(23),  it 
is  customary  to  write  it  without  sign. 
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From  (13)  we  may  note  that  since  Sl(23)  is  a  positive  quantity, 
0  ^  /2|(23)  ^  1.  When  R^zz)  is  equal  to  unity  numerically,  that  is, 
when 

^2  4"  ^3  +  ^3       2rx2ri3^23  ~  1 

we  have  perfect  multiple  correlation.  In  this  case  the  points  are 
on  the  regression  plane. 

The  coefficient  of  multiple  correlation  is  an  expression  which 
measures  the  degree  of  relationship  between  a  single  dependent 
variable  and  a  number  of  independent  variables  in  combination. 
It  is  more  accurately  defined  as  the  ordinary  cross-product  coefficient 
of  correlation  between  the  Xi  estimated  by  (10)  and  the  observed 
Xi,  or  between  Xi  estimated  by  (9)  and  the  observed  xi.  (See  Ex- 
ercise 1  of  the  next  list  of  exercises.) 


EXERCISES 

1.  (a)  The  estimated  value  of  Xi^  say  Xu,  may  be  found  from  xie  =  hnx^ 
+  hnXz  where  612  and  613  are  given  by  (7)  or  (8).  Show  that  the  standard 
deviation  c^e  of  Xu  is  given  by      =  bi2<Tia2ri2  +  bizCia^fiz. 

Hint:  Use  ah  =        —  j^^^J        equatiops  (7). 

(b)  Show  that  Su2Z)  =  o-J  —  ale. 
Hint:  Use  equation  (12)  and  (a). 

(c)  Show  that  Ri(2z)  =  —  • 

ai 

(d)  Show  that  XxiXie  =  Nau- 

Hint:  Multiply  the  value  of  Xie  in  (a)  by  Xi^  and  sum.  Change  the 
S  quantities  on  the  right-hand  side  into  statistical  symbols. 

(e)  Show  that  nu  =  —  * 

ai 


(f)  Show  that  i?i(23)  =  n  i«. 
2.  Show  that  Rim)  = 


2  +  6132x1X3 


3.  Show  that  Rim)  =  —  [^126120-8  +  rubnazj. 
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4.  Show  that,  for  the  least-squares  plane,  the  algebraical  sum  of  the 
residuals  is  zero. 

5.  State  three  important  properties  of  the  least-squares  plane  fitting  a 
set  of  N  points. 

6.  (Da vies  and  Crowder.)  The  following  table  gives  the  rankings  of  the 
specified  states  in  1860. 

You  can  save  labor  by  using  the  values  of  ZJX  and  given  on  pages  9 
and  10,  or  by  using  (20)  page  265. 

Xi  =  rank  of  the  specified  state  in  notables 
X2  =  rank  of  the  specified  state  in  education 
X3  =  rank  of  the  specified  state  in  capital 


State 

Xi 

X2 

X3 

State 

Xi 

X2 

Xs 

Alabama 

24 

23 

24 

Mississippi 

28 

17 

27 

Arkansas 

29 

27 

29 

Missouri 

19 

18 

19 

Connecticut 

2 

2 

3 

New  Hampshire 

5 

5 

7 

Delaware 

8 

19 

8 

New  Jersey 

9 

13 

4 

Florida 

27 

29 

28 

New  York 

7 

7 

6 

Georgia 

25 

25 

23 

North  Carolina 

22 

28 

22 

Illinois 

14 

14 

16 

Ohio 

10 

11 

10 

Indiana 

17 

16 

14 

Pennsylvania 

12 

10 

5 

Iowa 

16 

12 

26 

Rhode  Island 

4 

8 

1 

Kentucky 

20 

22 

15 

South  Carolina 

21 

21 

21 

Louisiana 

26 

20 

25 

Tennessee 

23 

24 

18 

Maine 

6 

4 

12 

Vermont 

3 

3 

11 

Maryland 

13 

15 

9 

Virginia 

18 

26 

13 

Massachusetts 

1 

1 

2 

Wisconsin 

15 

6 

20 

Michigan 

11 

9 

17 

(1)  Verify  the  values: 

Ml  =15  2  =  15  Mz  =  15 

0-1  =  8.367  0-2  =  8.367  <tz  =  8.367 

ri2  =  0.867  ri3  =  0.886  r23  =  0.670 


(2)  Find  Rn2S)  and  /Si(23). 


74.  DETERMINANTS 

A.  Determinants  of  the  Second  Order. 

If  we  solve  the  equations 

aixi  +  biyi  =  Ci 
CWi  +  622/1  =  C2 
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simultaneously,  we  obtain  the  solutions: 

C162  —  C261 


Xi  = 


0162  0261 


2/1 


aiC2  —  CI2C1 

0162  —  CL2bi 


By  adopting  the  shorthand  notation 


ai  bi 


=  ai&2  "~  ^261 


ci  hi 
C2  62 


=  C162  —  C261  etc. 


we  may  write  the  solutions 

ci  bi 


C2 


Xi  = 


a2 


bi 
b2 


a2 


Ci 

C2 


ai 
a2 


bi 

b2 


The  square  arrays  defined  above  are  called  determinants.  Since 
there  are  two  rows  and  two  columns,  we  call  the  arrays  determinants 
of  the  second  order.  The  letters  ai,  a2,  61,  62,  etc.  are  called  the  elements 
of  the  determinant.  The  elements  ai,  62  constitute  the  "principal 
diagonal  of  the  determinant  found  in  the  denominators  of  Xi  and  yi. 

We  note  that  the  denominators  of  Xi  and  yi  are  the  same  deter- 
minant, that  formed  from  the  coefficients  as  they  stand  in  the  given 
equations.  Further,  we  note  that  the  numerator  for  x\  may  be 
obtained  from  the  denominator  by  replacing  oi,  a2,  which  are  coef- 
ficients of  x\  in  the  given  equations,  by  the  terms  Ci,  C2.  Similarly, 
the  numerator  for  yi  is  the  determinant  of  the  denominator  with 
fei,  62  replaced  by  Ci,  C2  respectively.  The  determinant  of  the  de- 
nominator is  called  the  deter  ninant  of  the  system. 


Example.  Solve  by  determinants: 


X  +  y 
2x  +  Sy 


3 
1 


Solution: 


X  = 


3  1 

1  3 

1  1 

2  3 


9  -  1 
3-2 


=  8 


y  = 


1  3 

2  1 

1  1 

2  3 


1  -  6 
3-2 


-  5 
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EXERCISES 

Solve  the  following  pairs  of  equations  using  determinants: 

1.      x  +  y  ^  2 

2x  +  Sy  '=  7 
3.  0.3a;  +  0.2y  =  4.0 

0.7a?  -  0.6^  =*  26.4 


2.    a;  -  32/  «  6 
Ax  —  by  -  24 

4.     4tx  -  Sy  =  17 
12aJ  +  162/     -  9 


B.  Determinants  of  the  Third  Order.  The  solution  of  three 
equations  in  three  unknowns  is  also  facilitated  by  the  use  of  de- 
terminants. In  this  case  we  have  square  arrays  of  three  rows  and 
three  columns  or  determinants  of  the  third  order.  The  square  array 
in  the  left-hand  member  of  the  equality 


D  = 


ai    bi  Ci 

a2     ^2  C2 

as    63  Oz 


=  ai 


62 

Ci 

■  h 

a2 

C2 

+  Ci 

a2 

62 

bs 

as 

Cs 

as 

is  a  determinant  of  the  third  order.  It  is  defined  in  terms  of  de- 
terminants of  the  second  order  as  in  the  right-hand  member  of  the 
above  equality  which  is  called  the  expansion  of  the  determinant. 
The  elements  ai,  62,  Cz  constitute  the  principal  diagonal. 

The  second  order  determinants  in  the  above  equality  are  called 
minors  of  the  elements  ai,  bi,  Ci  respectively.  The  minor  to  ai  is 
the  determinant  that  remains  after  crossing  out  the  row  and  the 
column  in  which  ai  lies.  Similarly  the  minor  for  any  other  clement 
is  found. 

The  above  determinant  was  expanded  according  to  the  elements  of 
the  first  row.  We  may  also  expand  it  according  to  the  elements  of 
the  first  column.  Thus, 


D  = 


ai  bi  Ci 
a2  &2  C2 
as   bs  Cs 


=  ai 


C2 

bi 

Ci 

+  as 

bi 

Ci 

bs 

O2 

bs 

62 

Cs 

Cs 

Cz 

It  is  obvious  that  the  complete  development  of  a  determinant  of 
the  third  order  has  six  terms.  Thus, 

D  =  016263  -f-  a2bsCi  +  asbiC2  ~  aibsC2  —  a,i62Ci  —  026103 
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If  we  solve  by  elementary  algebra  the  ec^uatioiiB 


ciix  +  biy  +  CiZ  =  di 
0'2X  +  b2y  +  CiZ  =  di 
dzx  +  bzy  +  CiZ  =  dz 


for    we  obtain 


X 


dibjCs  +  ^2^301  +  dzbiC2  —  dibzC2  —  djbiCi  —  djbiCz 
aibiCs  +  a^bzCi  +  a^biCi  —  aib^a  —  a^ibzCi  —  026103 


The  denominator  is  the  development  of  the  determinant  D,  above, 
and  the  numerator  is  the  same  as  the  denominator  with  a,-  replaced 
by  dt,  i  -  1,  2,  3.    Hence  we  can  write 


X  = 


di  bi  Ci 
di  bi  Ci 
dz   63  Cz 


In  a  similar  way  we  can  find  y  and  z: 

ai  di  Ci 
ai  di  Ci 
as   dz  Cz 


ai 
Ui 
az 


\ 

bi  di 

bi  di 

bz  dz 


y  = 


ai  bi  Ci 
tti  bi  Ci 
as    63  Cz 


z  = 


ai 
ai 
az 


bi  ci 
bi  Ci 
bz  Cz 


We  note  that  the  denominators  of  y,  and  z  are  the  same,  the 
determinant  of  the  system.  The  determinant  in  the  numerator  of 
any  unknown  can  be  obtained  from  the  denominator  by  replacing 
the  column  of  the  coefficients  of  this  unknown  by  the  corresponding 
known  terms,  di,  di,  dz> 

In  the  expansion  of  D  we  note  that  the  sign  preceding  the  minor 
of  ai  is  plus,  that  preceding  the  minor  of  ai  is  minus,  that  preceding 
the  minor  of  az  is  plus.  The  sign  preceding  a  minor  corresponding 
to  an  element  is  easy  to  remember.  Consider  an  element  in  the 
/i-row  and  fc-column.  If  (h  +  k)  is  even  the  sign  prefixed  to  the 
minor  is  plus,  and  if  (h  +  k)  is  odd  the  sign  prefixed  to  the  minor 
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is  minus.  The  mii|or  of  an  element  with  its  sign  attached  is  called 
the  co-factor  of  the  element.  We  note  that  D  is  equal  to  the  sum 
of  the  products  of  any  row  (or  column)  and  their  respective  co- 
factors. 


EXERCISES 


1.  Evaluate  the  determinant 


1 
2 
1 


3 
7 
3 


4 

3 
5 


by  expanding  (a)  according 


to  the  elements  in  the  first  row,  and  (b)  according  to  the  elements  in  the 
first  column. 

Solve  for     t/,  and  z  the  equations : 
2.       X  —  2/~"2;=—  6  3. 
2a;  +  2/  +  s  =  0 
Zx-by  +  ^z=  13 


a:  +  21/  -  2  =  6 
2x  -  y  +  Zz  ^  -  13 
3x  -  2z/  +  32  =  16  • 


C.  Determinants  of  Any  Order.  We  defined  a  determinant  of 
the  third  order  in  terms  of  the  elements  of  a  row  (or  column)  and 
their  minors.  Similarly  we  may  define  determinants  of  the  fourth 
and  higher  orders.  Thus,  the  following  determinant  of  the  fourth 
order 


ai 
a2 


h 


Cl 

Ci 
C4 


C2 
Ca 


d2 
dz 
di 


—  02 


+  ^3 


bz 
bA 

bi 

^4 


Cl 
Cz 
Ca 

Cl 
C2 
Ca 


di 
dz 
dA 

di 
d2 
dA 


Oa 


bi 
bz 


Cl 

C2 

Cz 


di 

C?2 

dz 


is  defined  in  terms  of  the  elements  of  the  first  column  and  their 
minors.  The  sign  preceding  a  minor  of  an  element  in  /i-row  and 
/c-column  is  plus  or  minus  according  as  (h  +  k)  is  even  or  odd.  A 
minor  of  an  element  with  its  sign  attached  is  the  co-factor  of  the 
element.  The  value  of  a  determinant  is  the  sum  of  the  products  of 
the  elements  of  a  row  (or  column)  and  their  co-factors. 

Just  as  we  define  determinants  of  the  third  and  fourth  orders  in 
terms  of  the  elements  of  a  row  or  column  and  their  co-factors,  so  we 
define  a  determinant  of  any  order  to  be  the  mm  of  the  products  of  the 
elements  of  a  row  (or  column")  and  their  respective  co-factors. 
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EXERCISES 


1.  Expand  the  following  determinants  (a)  according  to  the  elements 
of  the  first  row,  and  (b)  according  to  the  elements  of  the  first  column. 


(1) 


2 
1 

-  2 
2 


4 
2 
0 
3 


-23 
1  0 
~  1  3 
-23 


(2) 


-  2 
5 
4 
1 


1  3 
3  3 
0  2 

2  3 


0 
1 
4 

3 


2.  The  following  theorems  are  true  for  determinants  of  any  order 
We  ask  the  student  to  prove  them  for  determinants  of  the  third  order. 

(1)  If  the  corresponding  rows  and  columns  of  D  be  interchanged,  D  is 
unchanged  in  value. 

(2)  If  any  two  rows  (or  columns)  of  D  be  interchanged,  D  becomes  —  D. 

(3)  If  any  two  rows  (or  columns)  be  identical,  D  =  0. 

(4)  If  each  element  of  a  row  (or  column)  of  D  be  multiplied  by  A;,  the 
value  of  the  resulting  determinant  is  kD,  ' 

(5)  If  to  each  element  of  a  row  (or  column)  of  D  is  added  k  times  the 
corresponding  element  of  another  row  (or  column),  D  is  unchanged 
in  value. 


75.   APPLICATION  OF  DETERMINANTS 

Three  Variables 

The  results  of  the  analysis  of  the  foregoing  sections  on  multiple 
correlation  can  be  expressed  in  very  simple  forms  by  the  use  of 
determinants. 

Let 


D  = 


rn 
r2i 


ri2 
r32 


ri3 
r23 

^33 


1 

r2i 


ri2 
1 

n2 


r23 
1 


where  rhk  is  the  element  in  the  /i-row  and  the  fc-column.  Evidently 
Thh  =  Tkk      1,  and  Thk  =  Tkh- 

A  minor  Dhk  of  the  element  nk  is  the  determinant  formed  by  the 
elements  that  remain  after  striking  out  all  the  coefficients  in  the  row 
and  the  column  common  to  nk^  Thus,  for  examples, 


D 


11 


/)l2  = 


r22  T2Z 

^32  ^33 

7*21  r23 

^31  ^33 


=  1 


'"is 


ri2  —  ri3r23 
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^31  rz2 

A  co-factor  4  a*  of  the  element  r/ifc  i$  the  minor  Dhk  with  the  sign  that 
would  be  prefixed  to  it  when  the  determinant  D  is  expanded.  The 
sign  that  is  prefixed  to  the  minor  is  positive  or  negative  according 
BS  (h  +  k)  is  even  or  odd,  That  is 

Expanding  D  according  to  the  elements  of  the  first  row,  wq  have 

D  ==  riiDii  -  ri2Di2  +  rigDis  ) 

=  riiAn  +  rizAiz  +  risAu  \  (16) 

=  1  -  ^12  -  ^13  -  ^3  +  2ri2ri3r23  ^ 
Now  let  us  solve  equations  (7)  by  determinants  and  note  the 
simplicity  of  the  results. 

bi2(Tt  +  bisCzr^s  =  airi2 
bi20'2r2s  +  bisas  =  (Tiru 


(7) 


We  obtain 


bi2  =^ 


bi3  = 


o-irn 

0'3r23 

r\2 

^23 

OTiru 

ns 

1 

0-2 

(Tzr2z 

cr2 

1 

r23 

(T2r2Z 

o-z 

^23 

1 

<J2 

c^irn 

1 

ri2 

0'2r23 

(Tiru 

r23 

ri3 

(^2 

(Tzr2z 

0-3 

1 

r23 

cr2r23 

0-3 

r23 

1 

The  regression  coefficients  in  the  determinant  notation  are 


,  D12  <Tl 

On  —  ^  = 


A  12  OTi 


biz  = 


Ai  All 
—  A3  ^1  _  —  ^13 


(17) 


Dll     (73         All  <Tz 

and  the  equations  of  the  regression  planes  (9)  and  (10)  are 


and 


^^Au  +  -'A12  +  ?Ai3  =  0 

<Jl  (72  (^Z 


(7i  (72  (73 


(18) 
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or 

lltiAu  =  0 
1=1 

where 


Applying  the  determinant  notation  to  equation  (13)  we  get 

Si(23)  =  (r,\J  ^  (19) 
which,  when  substituted  in  (14),  leads  to 

fii(23)  =  \/l  -  (20) 

EXERCISES 

1.  Analyze  the  data  of  Exercise  3,  page  284,  using  determinants. 

2.  Analyze  the  data  of  Exercise  4,  page  285,  using  determinants. 

3.  Analyze  the  data  of  Exercise  6,  page  288,  using  determinants. 

76.  PARTIAL  CORRELATION 

Sometimes  a  correlation  between  two  factors  is  due  to  the  influence 
of  one  or  more  other  factors  rather  than  to  any  inherent  relationship 
between  the  two  themselves.^  For  this  reason  it  is  necessary  to 
eliminate  as  far  as  possible  those  uncontrolled  factors  which,  through 
their  common  relation  to  the  variables  to  be  correlated,  tend  to 
influence  unduly  the  true  correlation.  This  is  accompUshed  by  a 
technique  known  as  partial  correlation. 

It  is  desirable,  therefore,  to  obtain  the  correlation  between  Xi 
and  X2,  say,  when  X3  has  a  fixed  value.  For  example,  we  can  find 
the  correlation  between  English  and  Mathematics  (p.  285)  when 
Intelligence  is  constant,  say  100,  but  not  completdy  ignored. 

In  bivariate  correlation,  it  will  be  recalled  that  the  values  of  the 
regression  coefficients  613  and  621  of  the  regression  equations 

Xi  =*=  612X2  +  Ci  and   X2  =  bsiXi  +  (^t 

were  found  to  h6  ^ 

hi2  =  ri2-^  and   bti  =  ri2^ 
1  See  page  268.  «  See  pages  248  and  249. 
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The  quantity  612  measures  the  regression  of  Xi  on  X2  and  621  meas- 
ures the  regression  of  X2  on  Xi  when  all  other  factors  are  ignored.  We 
also  found  that 

rl2  =  bi2  •  621  (21) 

Similarly,  612.3  and  621.3  mefiasure  the  regression  of  Xi  on  X2  and 
of  X2  on  Xi  respectively  in  the  equations  ^ 

Xi  =  612.3X2  +  613.2X3  +  Ci   and   X2  =  621.3X1  +  623.1X3  +  C2 

when  Xz  is  held  constant  but  not  ignored.  Since  the  conditions  leading 
to  equation  (21)  in  bivariate  correlation  are  exactly  paralleled  here, 
we  define  the  partial  correlation  coefficient  ri2.3  between  Xi  and  X2 
for  an  assigned  value  of  X3  by  the  equation 

^12.3  =  612.3  •  621.3    or    ri2.3  =  V612.3621.3 
In  terms  of  the  constants  previously  determined  in  (17)  we  find 


=  4_  i         ^    Dn  cr2  ^    ^  D12    ^    db  A12 

^^^•^  =^  V  Dn  CT2  *  D22  (Ti   VAr^2  va;^2   ^  ^ 


since  the  major  determinant  is  symmetrical  about  the  principal 
diagonal  and  hence  D12  =  Z)2i.  The  sign  attached  to  ri2.3  is  that  of 
612.3  or  612. 

It  is  noted  that  ri2.3  is  generally  unequal  to  ri2.  The  quantity  ri2 
measures  the  degree  of  correlation  between  Xi  and  X2  when  all 
other  factors  are  completely  ignored  whereas  ri2.3  measures  the  de- 
gree of  correlation  between  Xi  and  X2  when  X3  is  held  fixed  but  not 
ignored.  The  principal  application  of  partial  correlation  is  thus 
approximating  what  the  correlation  between  two  variables  would 
be  if  the  influence  of  other  variables  was  eliminated. 

Professor  Sorenson  ^  gives  an  interesting  illustration  that  shows 
the  influence  of  the  third  variable  on  the  correlation  between  the 
other  two  variables.  In  his  illustration 

Xi  represents  the  carpal  area  of  children 
X2  represents  the  mental  age  of  children 
X3  represents  the  chronological  age  of  children 

^  The  subscripts  following  the  point  merely  indicate  the  variables  that  are 
held  fixed  in  the  development.  They  may  frequently  be  omitted  from  the  detail. 

*  Herbert  Sorenson:  Statistics  for  Stiuients  of  Psychology  and  Education, 
p.  252. 
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The  following  simple  correlations  were  obtained. 

ri2  =  0.83         ri3  =  0.92         ras  =  0.88 

Naturally  we  are  impressed  by  the  apparently  large  correlation 
between  the  skeletal  development  (carpal  area)  of  children  and 
their  mental  age.  When  we  '^partial  ouf  or  remove  the  influence 
of  the  third  factor,  chronological  age,  we  find 

Dn   0.0204  _  ,  ^ 

ri2.3  =  =   -_—  =  {),\\ 

VDnD22     V(0.2256)  (0.1536) 
which  indicates  very  slight,  if  any,  correlation. 

EXERCISES 

1.  Express  rn.z  in  terms  of  simple  correlation  coefficients. 

2.  Write  down  the  values  of  riz.2  and  r2z.i- 

3.  Show  that:      aSi(23)  =  <riV (i  -  rif2)(l  -  rn.2) 

=  (Ti^a  -  r?3)(l  -  r?2.3) 

4.  By  permuting  the  subscripts  in  number  3  preceding,  write  down  the 
values  for  ^2(13)  and  Ss^n)- 

5.  In  a  certain  study  of  a  group  of  students'  grades 

Xi  denotes  the  percentage  grades  in  mathematics 
X'z  denotes  the  percentage  grades  in  chemistry 
Xz  denotes  the  percentage  grades  in  history 

Ml  =  72  <7i  =  8  ri2  =  .6 

M2  =  68  (72  =  10  ri3  =  .4 

M3  =  78  0-3  =  7  r23  =  .3 

What  is  the  probable  grade  in  chemistry  of  a  student  whose  grades  are: 
mathematics,  80%;  history,  70%? 

77.   THE  CASE  OF  FOUR  VARIABLES 

In  the  preceding  sections  we  have  considered  in  great  detail  the 
case  of  multiple  and  partial  correlation  based  upon  three  variables. 
We  shall  greatly  abbreviate  the  theory  for  the  case  of  four  variables 
leaving  the  details  to  be  supplied  by  the  reader. 

Assume  that  we  have  sets  of  data  in  the  four  variables  Xi,  X2, 
Xzy  Xa  and  that  we  wish  to  determine  the  regression  coefficients 
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bi2,  6i3,  buy  and  the  constant  c  so  that  Xi  computed  from  the  hyper- 
plane 

Xi  =  612X2  +  613X3  +  614X4  +  c  (23) 

may  be  the  best  estimate  of  Xi  for  assigned  values  of  X2,  X3,  X4. 
Adopting  the  least-squares  criterion,  we  may  determine  the  regression 
coefficients  so  that 

Sp2  =  XlXi  -  (6i2Xa  +  613X3  +  614X4  +  c)J  (24) 

shall  be  a  minimum. 

Equating  to  zero  the  first  partial  derivatives  of  Sp^  with  respect  to 
Cj  612,  613,  614,  we  obtain  the  normal  equations 

612SX2  +  613SX3  +  614SX4  +  A^c  =  SXi 

612SX2  +  613SX2X3  +  614SX2X4  +  CSX2  =SXiX2  .^KN 

6122X2X3  +  6132X3  +  6142X3X4  +  c2X3  =  2X1X3  ^  ^ 
6122X2X4  +  6132X3X4  +  6142X4  +  c2X4  =  2X1X4 

By  dividing  the  first  of  equations  (25)  by  Ny  we  may  show  that 
the  hyperplane  (23)  for  the  values  of  612,  613,  614,  c  given  by  (25), 
passes  through  the  point  {Mi,  M2,  M3,  M^). 

Referring  our  data  to  this  point  as  origin  our  regression  equation 
becomes 

Xi  =  612X2  +  613X3  +  614X4  (26) 
where  Xi  =  Xi  —  Mi,    i  =  1,  2,  3,  4 

That  is,  our  regression  equation  is  of  the  form  (26)  when  the  variables 
are  deviations  from  their  respective  means. 

By  minimizing  the  sum  of  the  squares  of  the  xi-residuals, 

2p2  =  S[xi  —  (612X2  +  613X3  +  614X4)]'-^ 

we  obtain  the  normal  equations 

6i22x|  +  6132x2X3  +  6142x2X4  =  2xiX2 1 

6122x2X3  +  6i32xi  +  6142x3X4  =  2x1X3  \  (27) 

6122x2X4  +  6132X3X4  +  6i42x4  =  2xiX4  J 

Expressing  the  summations  in  terms  of  standard  deviations  and 
coefficients  of  correlation,  equations  (27)  become 
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bi2(T2  +  6i3<r3r'23  +  &140'4^24 

bi2(T2r2i  +  bizdirzA  +  buCTA 


(28) 


Let  D  denote  the  major  determinant: 


D 


rn  ri2  ru  ru 

r2l  T22  ^23  ^24 

rzi  rz2  nz 

Til  n2  Uz  Ua 


1 

r2i 

rzi 


ri2 
1 

Tzi 

U2 


r2z 
1 

Uz 


rii 

7*24 
1 


Further,  let  Dhk  be  the  minor  and  Ahk  the  co-factor  of  nk  so  that 
Ahk  =  (~  lY'^^Dhk'   Then  the  solutions  of  (28)  become 

(Ti  D12  CTi  A 12 


612  = 


614  =  — 


0*2  Dn 
(Ti  Diz 
CTz  Dn 
0-1  Du 


0-2  An 
Ci  Aiz 
(Tz  An 

CTi  An 


(29) 


(74  Ai  O'a  An 

and  the  equations  of  the  regression  hyperplane  become 


-^An  +  -Au  +  -A^z  +  -Au  =  0 

(Ti  (72  (Tz  (Ti 


(30) 


and 


(Ti  (72  (73  C74 

expressed  in  terms  of  the  deviations  from  their  respective  means 
and  the  original  variates  respectively. 

If  the  respective  deviations  from  the  means  be  expressed  in  units 
of  their  standard  deviations,  that  is,  if 


ti  = 


Xi  Xj  —  Mj 
(Ti  ai 


,1  =  1,  2,  3,  4 


equations  (30)  and  (31)  become 

Ank  +  A  12^2  +  A  18^3  +  AuU 


=  0 


(32) 


or 


S^Aii  0 
i«l 
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Adopting  as  a  measure  of  the  accuracy  of  fit  of  (30),  (31),  or  (32) 
to  the  given  observed  values  the  quantity 


Hp' 


N 

after  some  rather  tedious  algebraic  operations  we  find 


Oi(234)      <^i^  n   Vo3) 


Dn 


(34) 


Equation  (33)  may  also  be  written 

>Sl(234)  =  CiVl  —  /?i(234)  (35) 


where 


Hun,,  -  y 


(36) 


Defining  ri2.34,  the  partial  coefficient  of  correlation  between  Xi 
and  X2  when  the  variables  X3  and  X4  are  held  fixed,  by  the  equation 

ri2.34  =  =t  V (612.34)  (&21.34) 
we  immediately  obtain 

D12  .  A12 

ri2.34  =   ±  _      =  ± 


V  DnD22  VAnA22 
Similarly 

Diz         ^     A 13 


VDiiDzz  VAnA 
Du         ^  Au 


33 


VDnDu  VA  11^44 

The  signs  of  these  values,  ri2.34,  ^3.24,  etc.  are  the  same  as  612,  &13,  etc. 

The  following  steps  are  recommended  in  the  computation  of  the 
constants,  assuming  that  the  arithmetic  means,  the  standard  devia- 
tions, and  the  simple  correlation  coefficients  have  been  computed. 

(1)  Write  down  Z). 

(2)  Compute  Dn,  ^22,  D33,  Daa* 
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(3)  Compute  Du,  Dis,  i>i4,  D2Zj  Du,  Du* 

(4)  Compute  An,  An,  etc.  from  Ahk     (—  lY^^Dhk* 

(5)  Compute  the  value  of  D  from  the  formula 

D  =  riiAn  +  TnAn  +  n^Aiz  +  ruAu 

(6)  Compute  612,  613,  etc. 

(7)  Compute  ri2.34,  ri3.24,  ri4.23. 

(8)  Compute  R^m)  from  equation  (36). 

(9)  Compute  S^^zza)  from  equation  (34). 
(10)  Write  down  the  regression  equation. 

EXERCISES 


1.  The  following  table  gives  the  fundamental  constants  obtained  from 
the  measurement  of  450  eggs.^ 


Length  {mm.) 

Breadth  (mm.) 

Bulk  (cc.) 

Weight  (gm.) 

relations 

Length 

1.0000 

0.0837 

0.5751 

0.5797 

Breadth 

0.0837 

1.0000 

0.8602 

0.8357 

Bulk 

0.5751 

0.8602 

1.0000 

0.9804 

Cor 

Weight 

0.5797 

0.8357 

0.9804 

1.0000 

Arithmetic 

Means 
Standard 

Deviations 

56.3222 
2.3862 

41.9167 
1.3777 

51.8400 
4.2438 

55.2400 
4.5923 

(a)  Find  the  regression  of  weight  upon  length  and  breadth. 

(b)  What  is  the  estimated  weight  of  an  egg  of  the  following  measure- 
ments: length  56.03  mm.,  breadth  42.02  mm.? 

(c)  Find  the  regression  equation  of  weight  on  length  and  bulk. 

(d)  Find  the  regression  equation  of  weight  on  bulk  and  breadth. 

(e)  Find  the  standard  errors  of  estimate  for  (a),  (c),  and  (d). 

(f)  What  is  the  best  combination  for  estimating  weight? 

2.  The  data  of  the  following  table  were  secured  from  measurements  of 
450  freshmen  at  Syracuse  University  ^ : 

Xi  =  Academic  success  as  measured  by  the  number  of  honor  points 
earned  by  the  student  during  the  first  semester  in  college. 

1  Pearl  and  Surface:  A  Biometrical  Study  of  Egg  Production  in  the  Domestic 
Fowl,  Part  III. 

'  May,  Mark:  Predicting  Academic  Success.  Journal  of  Educational  Psy- 
chology, Volume  XIV,  pp.  429-440. 
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X2  =  General  intelligence  based  upon  stand^-rdiz^d  te^ts. 

Xz  =  Industry  and  application  as  measured  by  the  number  of  hours 

per  week  spent  in  3tudy» 
Xi  =  Quality  of  preparatory  work  based  upon  average  high  school  grade. 

(a)  Find  the  regression  equation  of  Xi  on  X21  Xz,  and  Xi, 

(b)  Estimate  Xi  when  X2  =  108,  Xz  =  32,  ^nd  X4  =  82. 

(c)  Find  R\i2u)  and  *Si(234). 

(d)  Find  ri2.34,  7*13.24,  and  ri4.23. 


Xi 

X2 

Xi 

0 

1.00 

0.60 

0.32 

0.40 

X2 

0.60 

LOO 

-  0.35 

0.36 

Xz 

0.32 

-  0.35 

LOO 

0.11 

Xi 

0.40 

0.36 

oai 

LOO 

M's' 

18.5 

100.6 

24 

79 

cr's 

11.2 

15.8 

6 

7.5 

3.  Show  that 


^12.34  — 


^12.3  ~  7*14.3  ^24.3 

V(l  -^r^4.3)(l  -rks) 


4.  By  permuting  the  subscripts  in  number  3  preceding,  write  down  the 
values  of  ri3.24  and  ri4.23. 

5.  In  the  following  table  the  values  are  monthly  averages. 

Xi  =s  Wholesale  price  of  butter,  92  score,  in  ^  per  lb. 

X2  =  Apparent  consumption,  millions  of  pounds. 

Xi  =  Factory  production,  millions  of  pounds. 

Xi  =  Stocks  in  cold  storage  at  end  of  month,  millions  of  pounds. 


Factors  Affecting  Wholesale  Price  of  Creamery  Butter 


Year 

Xi 

X2 

Xi 

Year 

Xi 

X2 

Xz 

Xi 

1919 

61 

68 

72 

67 

1929 

45 

130 

133 

82 

1920 

61 

73 

72 

60 

1930 

37 

134 

133 

83 

1921 

43 

90 

88 

53 

1931 

28 

142 

139 

55 

1922 

41 

98 

96 

51 

1932 

21 

142 

141 

50 

1923 

47 

106 

103 

47 

1933 

22 

139 

147 

92 

1924 

43 

111 

113 

74 

1934 

26 

147 

141 

69 

1925 

45 

115 

114 

62 

1935 

30 

138 

136 

71 

1926 

44 

123 

121 

68 

1936 

33 

135 

136 

60 

1927 

47 

124 

125 

71 

1937 

34 

138 

135 

64 

1928 

47 

124 

124 

62 

1938 

28 

142 

149 

111 
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(1)  Find  the  values: 

Ml  =  M2  « 

ri2  =  ri3  == 

r23  =  7-24  ^ 


Mi  « 

0"$  = 

^34  = 


M4  = 
0^4  = 


(2)  Find  the  equation  of  the  regression  hyperplane  with  Xi  dependent 
upon  X2,  Xs,  X4. 

(3)  Find  /^i(234)  and  Sum)* 

78.   THE  CASE  OF  n  VARIABLES 

W^e  shall  give  a  summary  of  the  results  of  multiple  regression  for 
the  case  of  n  variables  leaving  all  of  the  details  to  be  carried  out  by 
the  student. 

Let  Zi  =  612X2  +  613X3  +  •  •  •  +  6inXn  +  c  (37) 

be  the  equation  which  gives  the  best  value  to  Xi  for  any  set  of  values 

of   X2,    Xzy  '         J  Xri' 

If  the  regression  coefficients  612,  613,  etc.,  and  c  are  determined  so 
that  the  sum  of  the  squares  of  the  Xi-residuals  is  a  minimum,  the 
point  (Ml,  M2,  .  .  Mn)  is  on  the  hyperplane  (37).  Transferring 
our  data  to  this  centroidal  point  as  origin,  the  regression  equation 
becomes 

Xi  =  612X2  +  6130:3  +  6140:4  +  •    •    •  +  6inXn  (38) 

where 

Xx    ~    Xl.  i  iy  —     1,    2,    3,     .      .      .  ,  /i 

Based  upon  the  principle  of  least  squares,  the  normal  equations 
for  the  determination  of  the  regression  coefficients  are 

6120-2  +  6i3(73r23  +   .    .    .    +  bln(Tnr2n  =  (TlTn 

6i2cri5r2s  +  6i3cr8  4-  .  .  .  +  6i„(r nrsn  =  or  1^3 


(39) 


6l20'2r2n  +  6i30-37*2n  +   •    •    •   +  6inO'n  =  O^irin  , 

Defining  the  major  determinant  D  by  the  equation 

1       ^12  A3 
r2i      1  r23 


D 


rn 

ri3 

Tin 

•  »  • 

Til 

^23 

r2ft 

•               •    •  ■ 

Vnl       Tn2  Tn% 


Tin 
Tin 


1 


(40) 
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we  find 


612  = 


613  =  - 


(T2D11  (T2A11 
(TiDiz  ClAiz 


(TzDn         0*3  All 

(TlDin  (TlAln 


(41) 


where  Dhk  is  the  minor  and  Ahk  is  the  co-factor  of  rhk, 

AHk  =  ly-'Dkk 
The  regression  equation  for  determining  the  best  value  of  xi  for  given 

values  of  X2f  XZf  .  .  .  ,  Xny  IS 


.  +  ^Au  =  0 


^42) 


In  terms  of  the  original  variates  the  equation  of  regression  is 
(J^LUmAn  +  +  •  •  •  +  ^^^^^A.  =  0  (43) 


(T2 


n 


Equations  (42)  and  (43)  may  be  written 


Alt  =  S^tAit  =  0 


(44) 


where 


Xi       Xi       A/^t       •        1  o 

ti  =  -r  =  — -:  '   ^  =  1;  ^  • 

at  ai 


,  n 


Adopting  as  a  measure  of  the  goodness  of  fit  of  (42)  to  the  given 
data  the  quantity 

}...n)  y 


s 


N 


where  p  =  xi  —  (612X2  +  613:^3  +  •  •  •  +  binXn)^  and  the  values  of  61,-, 
i  =  1,  2,  3,  .  .     n,  are  given  by  (41),  we  find 


>Sl(23...n)  = 

=  Vl  —  Rl(2Z*'»n) 


(45) 


THE  CASE  OF  n  VARIABLES  305 

where 


/2i(23...n)  =  y/l  —  (46) 

Defining  the  partial  coefficient  of  correlation  ru.23...n  by  the  equa- 
tion 


we  find 

ru.23...n  =   ±  -7====   =t  . 

VDiiDkk  VAnAkk 
The  sign  of  Thk.ab^'-n  is  the  same  as  that  of  hhk- 
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NONLINEAR  TRENDS:  CURVE-FITTING 

79.  INTRODUCTION 

The  investigator  in  any  branch  of  science  is  frequently  confronted 
with  quantitative  data  which,  when  plotted,  seem  to  lie  near  a  smooth 
curve  and  hence  to  obey,  approximately  at  least,  some  mathematical 
law.  Thus  the  following  table  gives  the  area  Y  (in  square  centi- 
meters) of  a  wound  at  the  end  of  X  days. 


Figure  41 


 1  1  1  1  1  1  1  1- 

0      4      8     12    16    W    24  28Days 


The  fact  that  these  data  when  plotted  [Figure  41]  lie  very  near  a 

smooth  curve  leads  us  to  suspect  that  they  can  be  represented, 

approximately,  by  the  equation  of  a  curve.  Such  an  equation,  whose 

form  is  inferred  from  the  results  of  experiment  or  observation  and 

whose  constants  are  determined  from  experimental  or  observational 

data,  is  known  as  an  empirical  equation.    The  empirical  equation, 

once  it  is  derived,  is  a  summarizing  expression  for  the  observed  data, 

and  it  may  be  used  to  obtain  a  good  approximation  to  the  value  of 
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the  true  ordinate  for  a  given  abscissa  within  the  range  of  values  used 
in  its  determination. 

The  problem  of  determining  the  type  of  equation  to  be  used  is 
an  indeterminate  one,  for  a  number  of  curves  can  be  drawn  to  pass 
very  near  the  plotted  points  and  hence  a  number  of  equations  can 
be  found  to  represent  the  data  approximately.  The  choice  of  the 
proper  mathematical  function  depends  a  great  deal  upon  the  investi- 
gator's knowledge  of  the  properties  of  curves  and  his  experience 
in  curve-fitting.  Fortunately,  there  are  a  number  of  simple  tests 
that  may  be  employed  to  enable  us  to  make  an  intelligent  choice 
of  the  type  of  equation  to  be  used.  Of  course  one  can  select  an 
equation  in  which  the  number  of  undetermined  constants  equals  the 
number  of  the  observations  and  thus  have  the  resulting  curve  pass 
through  the  observed  points  exactly,  but  this  process  emphasizes 
the  minor  fluctuations  that  represent  simply  errors  of  observation 
and  renders  impossible  the  discovery  of  a  simple  law.  A  better 
procedure  is  to  select  a  simple  type  of  function  involving  only  a 
few  constants  and  thus  allow  for  fluctuations  due  to  samphng. 

Having  chosen  a  particular  type  of  function  with  which  to  graduate 
the  data,  our  specific  question  is:  How  can  the  constants  of  the  equation 
be  determined  in  order  to  obtain  the  curve  of  that  type  of  best  fitf  The 
method  employed  depends  upon  the  desired  degree  of  accuracy. 
We  may  employ  one  or  more  of  four  methods:  (1)  the  method  of 
selected  points,  (2)  the  method  of  averages,  (3)  the  method  of  least  squares, 
or  (4)  the  method  of  moments.  Of  these  methods  the  first  is  the 
simplest;  the  second  requires  more  computation  than  the  first  but 
usually  gives  better  results;  the  third  requires  considerable  compu- 
tation but  gives  the  best  results  and  a  unique  answer  to  our  question ; 
the  fourth  gives  a  unique  answer  that  is  identical  to  that  obtained 
by  the  third  for  polynomial  functions. 

80.  THE  PROCESS  OF  DIFFERENCING 

In  the  preceding  section  we  alluded  to  certain  simple  tests  that 
may  be  employed  to  assist  us  in  choosing  the  appropriate  type  of 
equation  to  represent  our  data.  Inasmuch  as  these  tests  will  fre- 
quently be  stated  in  the  language  of  differences,  it  may  be  well  that 
we  digress  at  this  point  from  our  general  problem  to  learn  the  rudi- 
ments of  this  language.  Consider  the  following  table: 
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Table  66 

* 


X 

Y 

AF 

(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

u 

1 
1 

3 

1 
1 

A 

6 

Q 

6 

1 

o 

10 

A 
*± 

1 

u 

3 

20 

15 

5 

1 

0 

4 

35 

21 

6 

1 

0 

5 

56 

28 

7 

6 

84 

Corresponding  values  of  X  and  F,  where  Y  is  some  undetermined 
function  of  X,  are  given  in  columns  (1)  and  (2).  In  column  (3), 
headed  AF,  we  have  the  first  differences  of  Y.  Any  value  of  A  7 
is  found  by  subtracting  a  value  of  Y  from  its  successor.  Thus, 
3  =  4  -  1,  6  =  10  -  4,  etc.  Similarly  column  (4),  headed  A'F, 
is  obtained  by  subtracting  each  AF  from  its  successor.  These 
values  are  called  the  second  differences  of  F.  Other  differences  are 
found  in  a  similar  manner.  In  the  table  we  are  considering  it  may 
be  noted  that  the  values  of  A^F  are  in  arithmetic  progression,  those 
of  A^F  are  constant  and  hence  all  higher  differences  are  zero. 


EXERCISE 

Begin  at  the  right-hand  side  of  Table  66,  work  back  to  the  left  and  show 
that  when  X  =  7,  F  =  120. 

The  values  of  X  may  differ  by  amounts  other  than  unity.  In 
general  we  may  indicate  by  AX  the  difference  in  X.  When  the 
difference  in  successive  X's  is  the  same  —  that  is,  when  AX  is 
constant  —  and  F  is  a  function  of  X: 

AF,  =  F,+Ax  -  F,  (1) 

In  the  following  table,  where  again  F  is  an  undetermined  function 
of  X,  we  have,  for  example,  AX  =  2. 
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Table  67 


X 

Y 

A2F 

n 

2 

2 

2 

6 

4 

0 

4 

8 

10 

4 

0 

6 

18 

14 

4 

8 

32 

In  experimental  data  involving  two  variables,  the  independent 
variable  is  usually  subject  to  the  control  of  the  experimenter,  and 
the  values  of  the  independent  variable  are  frequently  given  in 
arithmetic  progression.  That  is,  if  X  is  the  independent  variable, 
AX  is  frequently  constant.  We  shall  see  that  this  precaution  on 
the  part  of  the  experimenter  may  greatly  simplify  the  discovery  of 
an  appropriate  equation. 

Consider  the  straight  line : 

y  =  mX  +  6 

We  have  from  (1) : 

AF  =  m(X  +  AX)  +  6  ~  (mX  +  6) 
AF  =  m  •  AX 

From  this  result  it  is  seen  that  if  AX  is  constant,  AF  is  also  con- 
stant; further,  AF/AX  is  constant  (compare  Section  57,  p.  204). 
Consider  now  the  parabola : 

F=  aX2  +  6X  +  c 

Applying  (1) : 

AF  =  a(X  +  AX)2  +  &(X  +  AX)  +  c  -  (aX^  +  6X  +  c) 
Ay  =  2aXAX  +  bAX  +  a(AX)2 
A(Ay)  =  A^y  =  2a(X  +  AX)AX  +  6AX  +  a(AX)2 

-  2aXAX  -  5AX  ~  a(AX)2 

A^y  =  2a(AX)2 

From  this  result  we  see  that  if  AX  is  constant,  the  second  difference 
of  the  polynomial  aX}  4-  6X  4-  c  is  also  constant. 
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One  may  continue  this  process  and  show  that  if  AX  is  constant, 
the  nth  difference  of  a  polynomial  of  the  nth  degree  is  also  constant. 

The  converse  of  this  theorem  is  also  true,  namely: 

//  for  a  constant  AX,  A^Y  is  also  constant,  then  Y  is  a  polynomial 
in  X  of  degree  n.^ 

The  nth  differences  of  the  values  of  Y  obtained  from  observational 
data  are  seldom  constant.  •  If,  however,  the  nth  differences  of  Y  are 
approximately  constant,  AX  being  constant,  we  can  represent  the 
data  approximately  by: 


EXERCISES 

1.  If  y  =  c,  show  that  AF  =  0. 

2.  If  F  =  X^,  show  that  A^F  =  6(AX)\ 

3.  Prepare  a  table  for  the  function  F  =^  2^^  -  3Z  +  4  for  Z  =  0,  1, 
2,  3,  4  and  find  the  second  differences  from  the  table. 

4.  Prepare  a  table  for  the  function  F  =  -  +  8Z  +  2  for 
Z  =  1,  3,  5,  7,  9  and  find  the  third  differences  from  the  table. 

5.  In  the  following  table,  AZ  is  constant  (=1)  and  A*F  is  constant 
(=2).  Hence  F  is  a  quadratic  function  of  Z: 

Y  =^  aX^  +  bX  +  c 

Find  the  values  of  a,  b,  and  c.   Find  F  when  Z  =  5. 
Hint: 

2  =  2a(AZ)2  =  2a{iy  =  2a 


X 

Y 

A7 

A2F 

0 

2 

-  1 

1 

1 

2 

1 

2 

2 

2 

3 

3 

5 

2 

5 

4 

10 

*  For  a  proof,  see  T.  R.  Running,  Empirical  Formvlas,  p.  18. 
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X 

Y 

A  •TT 

0 

0 

1 

-  1 

2 

4 

3 

21 

4 

56 

6.  Complete  the  accompanying 
table.  Find  the  function  that 
represents  the  data.  Find  Y  when 
X  =  5. 


7.  Prove  that  if  a  sequence  of  numbers  is  in  geometric  progression,  their 
logarithms  are  in  arithmetic  progression. 

8.  Prove  that  if  a  sequence  of  numbers  is  in  geometric  progression,  their 
first  differences  are  in  geometric  progression. 


81.   FITTING  A  STRAIGHT  LINE  TO  OBSERVED  DATA 

A  large  portion  of  Chapter  7  was  devoted  to  the  problem  of  fitting 
a  straight  line  to  observed  data  by  the  method  of  least  squares. 
We  desired  at  that  time  to  emphasize  the  method  of  least  squares 
because  we  were  then  interested  in  finding  a  unique  line  for  which 
we  could  secure  a  test  for  the  goodness  of  fit  and  thus  arrive  at  the 
Bravais-Pearson  cross-product  coefficient  of  correlation.  Since  one 
may  frequently  not  desire  so  accurate  a  solution  as  is  given  by  the 
method  of  least  squares  —  especially  at  the  price  of  tedious  computa- 
tion one  must  pay  to  secure  it  —  we  shall  discuss  two  other  less  ac- 
curate methods. 

A.  The  Method  of  Selected  Points.  To  apply  this  method  we 
must  plot  the  observed  data  carefully.  We  then  draw  a  straight 
line  among  the  points  which  will  pass  as  near  as  possible  to  each  of 
them.  Since  the  straight-line  equation 

r  =  mX  +  6 

has  two  undetermined  constants,  m  and  6,  we  must  obtain  two  equa- 
tions with  m  and  h  as  unknowns  from  which  to  determine  them.  If 
the  line  happens  to  pass  through  two  of  the  plotted  points  or  through 
any  other  two  points  whose  coordinates  can  be  determined  approxi- 
mately, we  can  substitute  their  coordinates  in  the  given  equation  and 
solve  the  two  resulting  equations  for  m  and  6.  In  any  case  the  points 
so  used  should  be  as  far  apart  as  possible. 
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Consider  again  the  temperature-resistance  data  of  Table  42, 
to  which  we  have  previously  given  attention  in  Section  59  (p.  211). 

Figure  42 

Table  68 


t 

R 

10.5 

10.42 

29.5 

10.94 

42.7 

11.32 

60.0 

11.80 

75.5 

12.24 

91.1 

12.67 

10    20    SO    40    50    60    70    80    90  100 


These  data,  when  plotted,  present  six  points  that  may  seem  to  lie 
upon  a  straight  Une.  Let  us  seek  further  evidence  by  applying  the 
test  for  straight-line  data.  We  have  learned  in  the  preceding  section 
that  if  AF/AX  is  constant  the  data  can  be  fitted  to  a  straight-line 
equation.    In  Table  69  we  have  computed  the  several  values  o^ 

Table  69 


t 

R 

^R 

^R/M 

10.5 

10.42 

19.0 

0.52 

0.0274 

29.5 

10.94 

13.2 

0.38 

0.0289 

42.7 

11.32 

17.3 

0.48 

0.0277 

60.0 

11.80 

15.5 

0.44 

0.0284 

75.5 

12.24 

15.6 

0.43 

0.0276 

91.1 

12.67 

Mem 

=  0.0280 

AR/Al   Since  they  are  approximately  constant  we  are  justified  in 
concluding  that  the  data  may  be  fitted  approximately  to  a  straight- 
line  equation:  ^  .  , 
^                            R  =  mt  +  b 
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The  line  we  have  drawn  does  not  pass  through  any  of  the  given 

points.    However,  it  seems  to  pass  through  the  points  ^(20,  10.7) 

and  5(90,  12.6)  whose  ordinates  we  have  estimated  from  the  graph. 

Substituting  in  the  given  equation  the  coordinates  of  the  points 

we  have:  r  ,  irt^ 

b  +  20m  =  10.7 

b  +  90m  =  12.6 

from  which  we  obtain 

m  =  0.027  b  =  10.157 

Hence  the  required  relation  is: 

R  =  0.0271  +  10.157 
The  least-square  solution  (Exercise  1  on  p.  220)  gives: 

R  =  0.02799^  +  10.122 


EXERCISE 

Assume  the  line  passes  through  the  first  and  last  points,  (10.5,  10.42) 
and  (91.1,  12.67),  and  find  its  equation. 

It  will  be  noted  that  the  arithmetic  mean  of  the  values  of  AR/At 
in  Table  69  is  0.0280.  How  may  this  be  used  in  finding  an  equation 
for  a  line  fitting  our  data  approximately? 

If  we  take  this  average  slope  as  the  slope  of  our  required  line  we 

R  =  0.0280^  +  b 

We  can  now  substitute  the  coordinates  of  each  of  the  six  given 
points  and  thus  determine  six  values  of  6.  Their  mean  may  be  taken 
as  the  value  of  b  for  the  required  line.  We  shall  leave  the  computa- 
tion as  an  exercise  for  the  student.   He  should  receive  for  an  answer: 

R  =  0.0280^  +  10.1216 

B.  The  Method  of  Averages.  The  fundamental  principle  of  the 
method  of  averages  is  that  an  empirical  curve  of  given  type  best 
fitting  a  given  group  of  points  is  one  for  which  the  algebraic  sum  of 
the  residuals  is  zero.  (It  will  be  recalled  that  this  criterion  was  satis- 
fied by  the  line  determined  by  the  method  of  least  squares.  0  From 
Section  59  (p.  210),  if  pi  is  any  residual: 

Pi  =  Ft  —  mXi  —  b 
and  Spt  =  SF»  —  mSX*  —  nb 

^  See  Exercise  6  on  p.  221. 
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Since  the  sum  of  the  residuals  is  zero  we  have: 

In  order  to  obtain  two  equations  which  may  be  solved  for  the 
unknowns,  m  and  b,  we  divide  our  data,  Table  68  (p.  312),  into  two 
groups  each  containing  three  sets  of  data.  For  the  first  group  we 
choose  the  first  three  sets  of  data  for  which  2<  =  82.7,  n  =  3,  S/2 
=  32.68,  and  for  the  second  group  the  remaining  three  sets  of  data 
for  which  2^  =  226.6,  n  =  3,  2/2  =  36.71.  We  then  have  the 
equations: 

82.7m  +  36  -  32.68 
226.6m  +  36  =  36.71 

from  which  we  obtain 

m  =  0.0280  6  =  10.121 

Hence  the  required  relation  is: 

R  =  0.0280^  +  10.121 

C.  The  Method  of  Least  Squares.  Curve-fitting  by  the  method 
of  least  squares  is  based  upon  the  principle  that  the  empirical  curve 
of  a  given  type  best  fitting  a  given  set  of  points  is  that  one  in  which 
the  constants  are  so  determined  that  they  will  make  the  sum  of  the 
squares  of  the  residuals  a  minimum.  Since  the  squares  of  the  resid- 
uals are  positive  quantities,  the  requirement  that  their  sum  shall  be 
a  minimum  gives  assurance  that  the  numerical  values  of  the  residuals 
will  be  such  that  the  best-fitting  curve  will  pass  as  close  as  possible 
to  all  the  points. 

Inasmuch  as  Section  59  (p.  210)  was  devoted  to  the  problem  of 
fitting  the  line 

F  =  mX  +  6 

to  a  set  of  points  by  the  method  of  least  squares,  we  shall  merely 
recapitulate  here  the  findings  of  that  section.  By  minimizing 

where  pi  is  the  F-residual  of  the  ith  point,  we  obtain  the  normal 
equations 

mXX  +  nb  =  2F 
mSX2  +  6SX  «  2XF 
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which,  when  solved,  gave: 

nLXY  -  SXZy 

EXERCISES 

1.  Show  that  the  point  (Mx,  My)  is  on  a  line  determined  by  the  method 
of  averages. 

2.  The  following  table  gives  the  population  of  France  at  each  census 
from  1806  to  1866.  Determine  by  the  method  of  averages  a  straight  line 
well  adapted  to  the  data,  choosing  X  =  0  at  1836. 


Population  of  France,  1806-1866 


Year 

Population 
(millions) 

Year 

Population 
(millions) 

1806 

29.11 

1851 

35.78 

1821 

30.46 

1856 

36.04 

1831 

32.57 

1861 

37.39 

1836 

33.54 

1866 

38.07 

1846 

35.40 

3.  Find  by  the  method  of  least  squares  the  equation  of  the  best-fitting 
straight  line  to  the  data  of  the  following  table.  What  are  the  predicted 
net  earnings,  based  upon  this  line,  for  the  year  1929?  The  actual  net 
earnings  were  48.5  millions. 


Annual  Earnings  of  the  Associated  Gas  and 
Electric  System,  1920-1928 ' 


Year 

Net  Earnings 
(millions  of  dollars) 

Year 

Net  Earnings 
(millions  of  dollars) 

1920 

13.4 

1925 

29.5 

1921 

16.2 

1926 

33.5 

1922 

19.2 

1927 

37.8 

1923 

22.7 

1928 

40.6 

1924 

25.1 

1929 

•  •  •  « 

1  The  data  are  from  Time,  Jan.  27,  1930. 
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82.   THE  EXPONENTIAL  FUNCTION  Y  =  ab^ 

Excepting  the  linear  function,  probably  no  expression  with  two 
undetermined  constants  is  more  useful  in  characterizing  observed 
data  than  the  exponential  function  Y  =  ab^.  It  may  be  described 
as  that  function  whose  rate  of  change  is  proportional  to  the  value 
of  the  function.  The  rate  of  change  may  be  positive  or  negative, 
that  is,  Y  may  increase  with  X  or  F  may  decrease  as  X  increases. 
Because  the  accumulated  amount  of  a  sum  of  money  placed  at  com- 
pound interest  at  a  given  rate  for  a  given  time  is  expressed  by  this 
function,  it  is  known  as  the  compound  interest  law.  Thus,  if  $100  is 
placed  at  compound  interest  for  X  years  at  5  per  cent  the  accumulated 
amount  Y  is  given  by : 

Y  =  100(1.05)^ 
We  represent  this  function  graphically. 


Figure  43  Table  70 


The  exponential  function  is  also  called  the  law  of  organic  growth 
because  many  biological  phenomena  obey  closely  this  law  of  growth. 
For  examples,  a  culture  of  bacteria,  or  populations  of  mice,  of 
rabbits,  of  human  beings,  when  placed  in  environments  conducive 
to  growth,  will  increase  for  a  time  in  approximate  accordance  with 
this  law. 

The  exponential  law  is  applicable  to  many  other  types  of  data. 
Many  data  from  the  commercial  and  the  economic  fields  show  ex- 
ponential trends.   We  find  the  law  especially  applicable  to  data  on 
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production  and  to  data  on  the  periodic  earnings  of  many  industrial 
organizations. 

A  simple  test  for  the  exponential  function  is  contained  in  the 
following: 

Theorem  I.  //  the  values  of  X  are  in  arithmetic  progression  and 
the  corresponding  values  of  Y  are  in  geometric  progression^  the  relation 
between  the  variables  is  expressed  by  the  equation: 


Table  71 


X 

Y 

r, 

=  X,  +  AX 

=  rY, 

=  Xi  +  2AX 

• 

=  r^Yi 

•  • 

=  Xi  +  (n  -  1)AX 

• 

•  • 

From  the  hypothesis  we  have  the  data  as  shown  in  the  accompany- 
ing table.  Since 

Zn  =  Xi  +  (n  -  1)AX 


we  have: 


and  hence 


or 

when 


Yn  =  Yir~^ 
^  Yir^x\r^l 

Yn  =  ab^^ 
a  =  Fir  AX    and   b  =  r^^ 


That  is,  any  X  is  connected  with  the  corresponding  Y  by  the 
relation: 

Y  =  ah^  (2) 


318  NONLINEAR  TRENDS:  CURVE-FITTING 

Illustrative  Problem  1.  Consider  Table  72,  which  gives  the  popu- 
lation of  the  United  States  at  each  ten-year  census  from  1800  to  1890. 

Tajble  72.  Population  of  the  United  States,  1800-1890 


Year 

Population 
(millions) 

Ratio  of  Each  Popur 
lotion  to  the  One 
Above 

t 

X 

P 

0 

•1 

1 
2 
3 
4 
5 
6 
7 
8 
9 

1800 

1  01  A 

1810 
1820 
1830 
1840 
1850 
1860 
1870 
1880 
1890 

5.3 
7.2 
9.6 
12.9 
17.1 
23.2 
31.4 
38.6 
50.2 
63.0 

1.36 
1,33 
1.34 
1.33 
1.36 
1.35 
1.23 
1.30 
1.25 

Mean  1.3167 

Let  ^  =  (X  -  1800)/10. 

Here  we  note  that  the  values  of  X  (and  t)  are  in  arithmetical  pro- 
gression and  that  the  values  of  P  are  approximately  in  a  geometric 
progression  since  the  ratio  of  any  population  to  the  one  preceding 
is  approximately  constant.  Hence  we  may  assume  that  the  data 
follow  approximately  the  exponential  law:  Pt  =  obK 

If  we  assume  that  the  point  i  =  0,  P  =  5.3  is  on  the  curve,  and 
that  the  decade  rate  of  increase  is  the  arithmetic  mean  of  the  ratios, 
we  can  immediately  obtain  a  first  approximation  formula: 

Po  ==  5.3  =  ab""  =  a 
a  =  5.3 

Since  by  definition 

Pt  =  a(1.3167)'  «  ab' 

we  have: 

6  =  1.3167 

Hence 

Pt  =  5.3(1.3167)' 

may  be  considered  a  first  or  crude  approximation.  By  assigning 
<  =  0,  1,  2,  3,  .  .  . ,  9,  the  computed  values  of  Pt,  which  can  be  com- 
pared with  the  observed  values,  can  be  found. 
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For  a  closer  approximation  we  proceed  as  follows.  From 

P  =  a¥ 

we  have: 

log  P     (log  b)t  +  log  a 

Now  let  Y  =  log  Pf    m  —  log  fe,    k  =  log  a. 

We  then  have: 

Y  ^  mt  +  k 

which  is  a  straight  line.  Therefore  we  can  fit  the  curve  P  =  to 
the  given  data  by  fitting  the  line  F  =      +    to  the  corresponding 

Y  =  log  P)  data.  We  shall  do  this  by  the  method  of  least  squares. 

From  Sections  59  (p.  218)  and  81  (p.  315)  we  have: 

uLtY  -  SSr     nZt  log  P  -  2S  log  P 


m 


k  = 


2^227  -  2S<y     2^22  log  P  -  2^2^  log  P 


We  shall  use  the  following  form  with  eight-place  logarithms  to 
assist  in  finding  m  and  k. 

Table  73 


t 

P 

logP 

^2 

t  log  P 

Computed 
P 

0 

5.3 

0.7242759 

0 

00.0000000 

5.5 

1 

7.2 

0.8573325 

1 

00.8573325 

7.3 

2 

9.6 

0.9822712 

4 

1.9645424 

9.6 

3 

12.9 

1.1105897 

9 

3.3317691 

12.7 

4 

17.1 

1.2329961 

16 

4.9319844 

16.8 

5 

23.2 

1.3654880 

25 

6.8274400 

22.2 

6 

31.4 

1 .4969296 

36 

8.9815776 

29.3 

7 

38.6 

1.5865873 

49 

11.1061111 

38.6 

8 

50.2 

1.7007037 

64 

13.6056296 

51.0 

9 

63.0 

1.7993405 

81 

16.1940645 

67.3 

45 

12.8565145 

285 

67.8004512 

m  =  log  6  = 


A;  =  log  o  = 


10(67.8004512)  -  45(12.8565145) 
10(285)  -  (45)2 

b  =  1.319967 

285(12.8565145)  -  45(67.8004512) 
10(285)  -  (45)2 


=  0.1205592 
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log  a  =  0.7431349 
a  =  5.535186 

Hence  our  law  is: 

P  =  5.535186(1.319967)' 

or 

log  P  =  0.1205592^  +  0.7431349 

By  assigning  ^  =  0,  1,  2,  .  .  . ,  9  we  obtain  the  computed  values  of 
P  which  are  found  in  Table  73. 

We  can  use  this  formula  to  predict  the  populations  in  1900,  1910, 
1920  by  assigning  t  =  10,  11,  12.  We  find  the  predicted  populations 
to  be  88.9, 117.3,  and  154.8,  whereas  the  actual  populations  were  76.0, 
92.0,  and  105.7.  This  shows  that  an  empirical  formula  must  be 
used  with  caution  for  values  outside  the  given  abscissal  range.  In 
this  particular  case  the  exponential  law  ceased  to  operate  after  1870 
and  we  began  then  to  approach  the  point  of  saturation. 

We  shall  leave  it  as  an  exercise  for  the  student  to  find  the  law  for 
the  population  based  upon  the  method  of  averages. 

It  frequently  happens  that  the  data  are  not  given  with  the  values 
of  the  independent  variable  in  arithmetic  progression  and  hence 
the  test  of  Theorem  I  will  not  apply.  In  such  cases  we  can  use  the 
following: 

Theorem  XL  If  the  variables  X  and  Y  are  so  related  that  A  log  F/AZ 
is  constant,  then  the  relation  between  them  can  be  expressed  by  the 
formula: 


Since  by  hypothesis 


A  log  Y 


AX 

we  have  by  Section  80  (p.  310): 

log  y  -  mX  +  A; 

or,  if    =  log  a, 

log  Y  =  mX  +  log  a 
log  F  —  log  a  =  mX 
log  (F/a)  =  mX 

Y/a  -  10^^  =  (lO'^)^  =  6^ 

or 

Y  =  al^ 

where  10*"  =  6. 
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We  phall  apply  this  theorem  to 

Illustrative  Problem  2.  The  following  table  shows  the  amount  A  of 
a  substance  remaining  in  a  reacting  chemical  system  at  the  expiration 
of  a  given  time  t  (Harcourt  and  Esson). 


Table  74 


t 

A 

loflT  A 

At 

A  Io2  A 

A  log  A 
M 

o 

Zi 

1  C\'7  (XQCiQQ 

i.y/Douoo 

3 

-  0.0328194 

-  0.0109 

5 

o7.y 

1  .y4oyooy 

3 

-  0.0338984 

-  0.0113 

o 
o 

ol.o 

i.yiuuyuo 

3 

-  0.0356087 

-  0.0119 

11 

74.9 

1.8744818 

3 

-  0.0375251 

-  0.0125 

14 

68.7 

1.8369567 

3 

-  0.0307767 

-  0.0103 

17 

64.0 

1.8061800 

10 

-  0.1133331 

-  0.0113 

27 

49.3 

1.6928469 

4 

-  0.0493942 

-  0.0123 

31 

44.0 

1.6434527 

4 

-  0.0512759 

~  0.0128 

35 

39.1 

1.5921768 

9 

-  0.0924897 

-  0.0103 

44 

31.6 

1.4996871 

The  values  of  A  log  A /At  are  fairly  constant  and  w^e  conclude 
therefore  that  the  data  may  be  represented  approximately  by 

A  =  ah' 

or 

log  A  -  (log  b)t  +  log  a 

or 

Y  =  mt  +  k 

when         F  =  log  A,     m  =  log  6,     and    k  =  log  a. 

We  shall  use  the  method  of  averages  to  determine  the  constants. 
Dividing  the  data  into  two  groups,  the  first  five  sets  of  data  for 
the  first  group  and  the  remaining  five  sets  of  data  for  the  second 
group,  we  obtain: 

27  =  S  log  A  =  9.5423262,  2e  =  40,  n  =  5 
27  =  S  log  A  =  8.2343435,     Xt  =  154,     n  =  5 
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Recalling  that  in  the  method  of  averages  the  sum  of  the  residuals 
is  zero;  that  is, 

or 

SF  =  mXt  +  nk 

we  have  upon  substituting  the  above  values: 

5k  +  40m  =  9.5423262 
5A:  +  154m  =  8.2343435 

Solving,  we  obtain: 

w  =  log  5  =  -  0.0114735 
fc  =  log  a  =  2.0002532 

b  =  0.973927 
a  =  100.0586 

Hence  the  law  is: 

A  =  100.0586(0.973927)' 

or 

log  A  =  -  0.0114735^  +  2.0002532 

By  substituting  the  given  values  of  t  we  obtain  the  computed 
values  of  log  A  from  which  we  obtain  the  computed  values  of  A 
which  are  shown  in  the  following  table. 


t 

Observed 
A 

Computed 
log  A 

Computed 
A 

Residuals 

2 

94.8 

1.9773062 

94.9 

-  0.1 

5 

87.9 

1.9428857 

87.7 

0.2 

8 

81.3 

1.9084652 

81.0 

0.3 

11 

74.9 

1.8740447 

74.8 

0.1 

14 

68.7 

1.8396242 

69.1 

-  0.4 

17 

64.0 

1.8052037 

63.9 

0.1 

27 

49.3 

1.6904687 

49.0 

0.3 

31 

44.0 

1.6445747 

44.1 

-  0.1 

35 

39.1 

1.5986807 

39.7 

-  0,6 

44 

31.6 

1.4954192 

31.3 

0.3 

Exercise.  Solve  this  problem  by  method  of  averages  using  four-place 
logarithms. 


THE  EXPONENTIAL  FUNCTION 


323 


EXERCISES  1 

1.  Show  that  an  exponential  curve  may  give  a  satisfactory  fit  for  the 
data  of  Table  65  (p.  306).  Fit  an  exponential  curve  to  these  data  and  esti- 
mate from  the  equation  the  values  of  Y  when  X  =  32  and  when  X  =  36. 
Compare  these  results  with  the  actual  values:  Y  =  21.3  when  X  =  32, 
and  Y  =  16.8  when  X  =  36. 

Answer:  Method  of  least  squares  gives  Y  =  108.8035(0.952348)^. 

2.  In  the  following  table  p  is  the  barometric  pressure  in  inches  of  a 
column  of  mercury  at  distance  h  in  feet  above  the  sea  level.  Show  that 
an  exponential  curve,  p  —  ab^,  may  be  appropriately  applied  to  these  data. 
Find  the  equation  of  the  best-fitting  curve  and  the  values  of  p  when  h 
=  1,000  ft.,  2,000  ft.,  5,000  ft. 


h 

0 

886 

2,753 

4,763 

6,942 

10,593 

V 

30 

29 

27 

25 

23 

20 

3.  The  following  table  exhibits  the  values  of  the  temperature  T  reached 
by  a  cooling  body  at  the  expiration  of  various  times  t.  Determine  the  best- 
fitting  curve  of  the  type  T  =  a¥  for  the  data  of  this  table. 


t 

0 

3.79 

11.93 

21.23 

31.68 

44.11 

59.12 

T 

17.9 

17.0 

15.2 

13.4 

11.6 

9.8 

8.0 

4.  Fit  an  exponential  curve  to  the  data  of  Exercise  1,  page  91. 

5.  The  following  observations  were  made  on  a  growing  plant.  The 
time  is  reckoned  in  days  from  the  first  observation.  What  is  the  law  of 
growth? 


Days 

0 

1 

2 

3 

4 

5 

6 

7 

8 

Height  (inches) 

0.75 

1.20 

1.75 

2.50 

3.45 

4.70 

6.20 

8.25 

11.50 

83.  THE  POWER  FUNCTION  Y  -  aX^ 

In  the  preceding  sections  of  this  chapter  we  have  dealt  with  the 
problems  which  involved  fitting  the  linear  function  Y  =  aX  +  b  and 
the  exponential  function  Y  =  ab^  to  observed  data.  A  third  function 
with  two  undetermined  constants,  the  power  function  Y  =  aX^, 
finds  frequent  application.  Owing  to  the  fact  that  the  constants 
can  be  determined  approximately  by  rectifying  the  curve,  that  is, 
by  transforming  it  into  a  straight-line  equation  —  as  was  done 

^  We  leave  it  to  the  discretion  of  the  teacher  to  suggest  the  method  that  is  to 
be  used  in  solving  these  exercises. 
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with  the  exponential  function  —  the  power  curve  is  not  difficult  to 
employ. 

The  power  function  is  parabolic  in  form  when  b  is  positive  and 
hyperbolic  if  b  is  negative.  The  parabolic  curves  all  pass  through 
the  points  (0,  0)  and  (1,  a)  and  also  enjoy  the  property  that  Y  in- 
creases with  X.  The  hyperbolic  curves  all  pass  through  the  point 
(1,  a),  have  the  coordinate  axes  as  asymptotes,  and  enjoy  the 
property  that  Y  decreases  as  X  increases. 

EXERCISE 

Plot  on  the  same  coordinate  axes  for  X  >  0  the  curves: 

a.  F  =  2Z2  d.    7  =  2X-^ 

b.  r  =  2X  e.    r  = 


c.    F  = 


f.    F  =  2X-<»-s 


A  simple  test  —  not  always  applicable  —  for  determining  if  the 
power  function  is  applicable  is  contained  in: 

Theorem  1.  If  the  values  of  X  are  in  geometrical  progression  and 
the  corresponding  values  of  Y  are  also  in  geometrical  progression^  then 
the  relation  between  the  variables  is  expressed  by  the  formula: 

Y  =  aXb 

Table  75 


X 

Y 

X, 

Yi 

=  rXi 

Y, 

=  RYi 

X, 

• 

•  • 

Yz 

m 

=  R^Yi 

•  • 

m 

Xn 

•  • 

• 

•  • 

=  72^-1  Fi 

From  the  hypothesis  we  have  the  data  as  in  Table  75. 

Xn  =  r-'Xi   and    Yn  =  R^-'Yi, 
we  have,  applying  logarithms: 

log  Xn  -  log  Xi 


Since 


md 


n  -  1  = 


n  -  1  - 


log  r 

log  Fn  -  log  Fi 
logie 
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Equating  these  values  of  (n  —  1)  and  writing  log  /2/log  r  =  6, 
we  have 

log  Yn  -  log  Yi  ^  , 
log  Xn  -  log  Xx  ^' 

that  is: 

log  Yn/Y,  =  log  {Xn/X,y 

or 

Yn  =  {Yi/X\)Xk  =  aXS 

where 

b  =  log  i2/log  r        and        a  =  Fi/Xl 

That  is,  for  any  set  of  corresponding  values  we  have: 

Y  =  aX^ 

There  is  an  evident  practical  difficulty  with  this  beautiful  theorem. 
There  is  rarely  any  reason,  or  even  an  opportunity,  for  the  observer 
to  gather  his  data  with  the  values  of  one  variable  in  geometric  pro- 
gression and  thus  make  possible  a  test  to  determine  if  the  other 
variable  is  also  in  geometric  progression.  In  general  the  observer 
has  no  predilections  as  to  the  law;  he  gathers  the  data  and  may  hope 
to  discover  a  law.  Very  frequently,  however,  the  careful  observer 
will,  if  possible,  secure  data  with  the  independent  variable  ordered 
in  some  definite  manner,  most  frequently  in  arithmetic  progression. 

When  Theorem  I  may  not  be  applicable  we  may  be  able  to  use  the 
following: 

Theorem  II.  //  the  values  of  X  and  Y  are  so  related  that 
A  log  Y/A  log  X  is  constant f  then  the  relation  between  the  variables  is 
expressed  by: 

Y  =  aX^ 

Since 

^  =  b,  a  constant, 
A  log  Z  ' 

we  have  by  Section  80  (p.  310): 

log  F  =  6  log  X  +  c 

log  Y  =  log  X^  +  log  a     (if  c  =  log  a) 

log  Y  =  log  aX^ 

or 

Y  =  aX' 

Consider  Table  76,  which  shows  the  currents,  t,  in  amperes  passing 
cluough  an  118- volt  tungsten  lamp  for  various  terminal  voltages,  6. 
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Table  76 


e 

• 

ridizo  oj  /LTiy  z 
In  PrprpfloYtn 

4 

0.0370 

1.51 

8 

0.0570 

1.54 

16 

0.0856 

1.50 

32 

0.1295 

1.51 

64 

0.2000 

1.54 

128 

0.3035 

1.53 

We  note  that  the  independent  variable,  is  given  in  a  geometric 
progression.  We  find,  as  the  table  shows,  that  the  corresponding 
values  of  i  are  also  essentially^  in  geometric -progression.  Therefore 
the  data  follow  the  law: 

i  =  ae^  (4) 

Since 

log  i  =  6  (log  e)  +  log  a  (5) 

if  we  let  Y  =  log  i,    Z  =  log        k  =  log  a 

we  have: 

F  =  6Z  +  fc  (6) 

which  is  a  straight  line.   Therefore  we  may  approximately  fit  the 
curve  (4),  i  =  ae*,  to  the  given  data  by  fitting  the  line  (5),  F  =  6X  + 
to  the  corresponding  {X  ==  log  6,  F  =  log  i)  data. 
We  shall  first  use  the  method  of  averages. 


Table  77 


e 

i 

log  e  =  X 

log  i  =  Y 

2 

0.0245 

0.3010300 

2.3891661 

4 

0.0370 

0.6020600 

2.5682017 

8 

0.0570 

0.9030900 

2.7558749 

16 

0.0855 

1.2041200 

2.9319661 

32 

0.1295 

1.5051500 

1.1122698 

64 

0.2000 

1.8061800 

T.3010300 

128 

0.3035 

2.1072100 

1.4821587 
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Dividing  the  data  up  into  two  groups,*  the  first  four  sets  constitut- 
ing the  first  group  and  the  last  three  sets  the  second  group,  we  have: 

n  =  4,        SX  =  3.0103000,        SF  6.6452088* 
n  =  3,        SX  -  5.4185400,        SF  =  3.8954585 

Substituting  these  values  in  the  residual  equation 

bI>X  +  nk  =  XY 

we  have: 

3.010306  +  4/b  =  4.6452088  -  10 
5.418546  +  3A:  =  7.8954585  -  10 

Solving,  we  have: 

6  =  0.6047655 

=  log  a  =  2.2061708 
a  =  0.016076 

Hence  by  the  method  of  averages  the  required  relation  is: 

I  =  0.016076^0  ^^-^^ess 

or  ^ 

log  i  =  0.6047655  log  e  +  2.2061708 

The  computed  values  by  this  equation  are  given  in  Table  79, 
page  329.  As  an  exercise  the  student  should  carry  this  problem 
through  by  averages,  using  four-place  logarithms,  and  compare  his 
results  with  ours. 

We  shall  now  solve  this  exercise  by  the  method  of  least  squares. 
The  exercise  affords  an  excellent  opportunity  for  illustrating  a  short 
method.  Continuing  Table  77,  we  shall  employ  the  following  sub- 
stitutions: 

Y  -  1.2041200  y'^Y4-l 
~     0.3010300  2/  -  1^  +  1 

or 

X  =  0.3010300x'  +  1.2041200   and    Y  =  y'  -  I  (7) 

Equation  (6)  will  then  become: 

2/'  -  1  =  6(0.30103002;'  +  1.2041200)  +  k 

or 

y'  =  (0.301036)a;'  +  (1.204126  +  Jfc  +  1) 
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or 

y'  =  mx'  +  fc' 

where       m  =  0.301036   and   k'  =  1.204126  +    +  1 
From  Section  59  (p.  218)  we  find  m  and  k'  by: 


(8) 


k'  = 


We  therefore  continue  Table  77  according  to  (7)  and  obtain 
Table  78. 

Table  78 


y' 

xY 

-  3 

-  0.6108339 

9 

1.8325017 

-  2 

-  0.4317983 

4 

0.8635966 

-  1 

-  0.2441251 

1 

0.2441251 

0 

-  0.0683339 

0 

0.0000000 

1 

0.1122693 

1 

0.1122698 

2 

0.3010303 

4 

0.6020600 

3 

0.4821587 

9 

1.4464761 

0 

-  0.4593327 

28 

5.1010293 

or 


We  can  now  find  m  and  k': 

m  =  0.301036  =  ^^^'^^ocf^^^  =  0.18217961 

6  =  0.6051875 

V  =  1.204126  +    +  1  =  ^^^^^"'^.^olf^^^^^  =  -  0.06561896 

=  log  o  =  2.2056627 
a  =  0.0160568 

Hence  by  the  method  of  least  squares  the  required  relation  is: 

i  =  0.0160568e0«^"876 

log  i  =  0.6051875  log  e  +  2.2056627 
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In  the  following  table  we  show  the  computed  values  which  have 
been  found  from  the  equation  determined  by  the  method  of  averages 
and  from  the  equation  determined  by  the  method  of  least  squares. 


Table  79 


Observed 
Values 

Computed  Values 

By  Least  Squares 

By  Averages 

e 

• 

log  i 

• 

log  i 

• 

t 

2 

0.0245 

2.3878423 

0.0244 

2.3882234 

0.0244 

4 

0.0370 

2.5700219 

0.0372 

2.5702759 

0.0372 

8 

0.0570 

2.7520148 

0.0565 

2.7523285 

0.0565 

16 

0.0855 

2.9343811 

0.0860 

2.9343810 

0.0860 

32 

0.1295 

1.1165607 

0.1308 

1.1164336 

0.1308 

64 

0.2000 

1.2987403 

0.1990 

1.2984862 

0.1988 

128 

0.3035 

1.4809199 

0.3027 

1.4805387 

0.3023 

EXERCISES 

1.  Find  an  equation  of  the  form  Y  =  aX^  for  the  data: 


X 

5 

7 

9 

15  20 

30 

40 

50 

Y 

1 

2 

3 

9  16 

37 

65 

100 

2.  Find  an  equation  of  the  form  Y  ~  aX^  for  the  data: 


X 

4 

8 

12 

16 

20 

24 

Y 

2.9 

23.0 

77.8 

184 

360 

622 

3.  Find  an  equation  of  the  form  Y  =  aX^  for  the  data: 


X 

10 

20 

30 

40 

50 

60 

Y 

11 

31 

57 

88 

122 

161 

4.  If  Y  is  the  diameter  of  a  tree  in  inches  at  age  X  years,  the  relation 
is  y  =  aX^,   For  the  following  data,  find  the  equation  of  the  given  type: 


X 

19 

58 

114 

140 

181 

229 

Y 

3 

7 

13.2 

17.9 

24.5 

33 

6.  A  body  in  sliding  down  a  plane  of  length  I  feet  attained  a  velocity 
of  V  feet  per  second.  Find  the  relation  V  =  al^  for  the  data  given  in  the 
table : 
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I 

19.9 

45.1 

67.5 

94.4 

109 

126 

V 

10.1 

15.2 

18.6 

22.0 

23.6 

25.4 

6.  The  quantity  of  water,  Q  pounds,  discharged  per  second  from  a 
circular  orifice  in  a  tank,  under  a  pressure  head  of  h  feet,  was  found  by 
experiment  to  result  in  the  following  data.  Find  the  equation  of  the 
type  Q  =  ah^. 


h 

0.583 

0.667 

0.750 

0.834 

0.876 

0.958 

1.0 

Q 

7.00 

7.60 

7.94 

8.42 

8.68 

9.04 

9.34 

7.  At  the  following  draughts,  h  feet,  a  particular  vessel  has  the  given 
tonnage,  T,  in  salt  water.   Find  the  equation  of  the  type  T  =  ah^. 


h 

15 

12 

9 

6 

T 

2100 

1510 

1020 

590 

84.   THE  PARABOLA  Y  =  aX^  +  bX  +  c 

Due  to  the  fact  that  the  quadratic  parabola  possesses  a  three- 
constant  flexibility,  it  is  very  useful  in  graduating  statistical  data 
from  many  fields.  Three  constants  are  to  be  determined,  and  this 
can  be  done  by  (1)  the  method  of  selected  points,  (2)  the  method  of 
averages,  (3)  the  method  of  least  squares,  and  (4)  the  method  of  mo- 
ments. 

To  apply  the  method  of  selected  points,  we  draw  a  curve  among 
the  plotted  points  which  will  pass  as  near  as  possible  to  each  of  them. 
If  the  curve  happens  to  pass  through  three  of  the  plotted  points  or 
through  any  other  three  points  whose  coordinates  can  be  approxi- 
mately determined,  we  can  substitute  their  coordinates  in  the  given 
equation  and  solve  the  three  resulting  equations  for  a,  b,  and  c. 
Of  course  the  points  so  used  should  be  chosen  at  the  extreme  and 
middle  portions  of  the  data. 

As  previously  stated,  the  method  of  averages  assumes  that  the 
sum  of  the  residuals  is  zero.  That  is: 

S(F  -aX^-bX  -  c)  ^0 

or 

aSJT^  -I-  bSZ+  nc  =  27  (9) 

In  order  to  obtain  three  equations  which  can  be  solved  for  the  un- 
knowns a,  by  and  c,  we  divide  our  data  up  into  three  sets.  For  each 
set  find  n,  SX,  SZ^,  and  27.  Substitute  in  (9)  and  solve  for  a,  5, 
and  c. 
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The  method  of  least  squares  can  be  used  to  advantage  with  this 
curve.  By  proceeding  as  in  Section  59  (p.  217),  we  can  find  three 
normal  equations  which  can  be  solved  for  a,  {>,  and  c.  Thus  if  pi  is 
any  residual: 

and 


2 


The  expression  2p?  can  be  written  as  a  quadratic  in  a,  in  6,  and  in  c. 
By  imposing  the  condition  that  Sp5  be  a  minimum  upon  each  quad- 
ratic we  find  the  normal  equations:  ^ 

aXX^  +  bSAT  +  cn  =  27  ' 
dZX^  +  bXX^  +  cZX  =  2  (10) 
a2JSr*  +  b2J3  +  c2;P  =  2J2y 

Note  that  the  first  equation  is  merely  the  summation  of  the  given 
function;  the  second  is  the  summation  of  X  multiplied  into  the  given 
function,  and  the  third  is  the  summation  of  X'^  multiplied  into  the 
given  function  (see  Exercise  16  at  the  end  of  this  chapter). 

If  the  values  of  X  are  in  arithmetic  progression  —  that  is,  if  AX 
is  constant  —  we  can  choose  our  units  in  such  a  manner  that  2X 
and  2A''^  are  zero.  Further  we  may  frequently  use  the  relationships 
in  Exercises  2,  page  10,  and  20b,  page  22,  to  determine  2X2  2X^. 
By  these  artifices,  the  solution  by  least  squares  is  not  so  laborious  as 
it  might  appear. 

A  test  for  the  use  of  the  parabola  is  contained  in  the  general 
theorem  of  Section  80  (p.  310).  We  shall  quote  here  the  theorem  for 
our  special  case. 

Theorem:  //,  when  AX  is  constant^  A*F  is  also  constant^  the  relation 
between  the  variables  may  be  expressed  by  the  equation: 

Y  =^  aX^     bX  +  c  (11) 

Illustrative  Example.  The  following  table  gives  the  modulus  of 
torsion  of  steel  T,  in  kilograms  per  square  centimeter,  at  various 
temperatures  d  in  degrees  Centigrade. 

1  A  knowledge  of  the  calculus  would  enable  the  student  to  write  out  such 
normal  equations  very  easily.  By  setting  the  partial  derivatives  of  SpJ  with 
respect  to  c,  6,  and  a  each  equal  to  zero,  equations  (10)  are  obtained. 
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Table  80 


LIU 

u 

I 

AT 

0 

8,290 

—  Ol 

20 

8,253 

-  1 

20 

-  38 

40 

8,215 

-  1 

20 

-  39 

60 

8,176 

-  1 

20 

-  40 

80 

8,136 

-  0 

20 

-40 

100 

8,096 

We  note  that  Ad  is  constant  (=  20)  and  that  A^T  is  nearly  con- 
stant, hence  by  the  preceding  theorem  the  data  follow  the  law: 

T     ad'^  +  bd  +  c  (12) 

We  shall  use  the  method  of  least  squares,  and  in  order  to  shorten 
the  work  we  shall  use  the  substitutions: 

X  =  ^  7^^^   and    y  =  T  -  8200 


10 


or 


0  =  lOX  +  50   and   7  =  7  +  8200 
Our  equation  (12)  then  becomes: 

where 

A  =  ma 

B  =  lOOOa  +  106  =  lOA  +  106 
C  =  2500a  +  506  +  c  -  8200 
C  =  5B  ~  25A  +  c  -  8200 

To  form  our  normal  equations  we  prepare  the  following  table: 


(13) 
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Table  81 


e 

T 

X 

Y 

X^ 

XY 

X^Y 

Computed 

0 

8,290 

-  5 

90 

25 

-  125 

625 

-  450 

2,250 

8,290.1 

20 

8,253 

-  3 

53 

9 

-  27 

81 

-  159 

477 

8,252.9 

40 

8,215 

-  1 

15 

1 

-  1 

1 

-  15 

15 

8,214.9 

60 

8,176 

1 

-  24 

1 

1 

1 

-  24 

-  24 

8,176.0 

80 

8,136 

3 

-  64 

9 

27 

81 

-  192 

-  576 

8,136.3 

100 

8,096 

5 

-104 

25 

125 

625 

-  520 

-  2,600 

8,095.8 

Total 

0 

-  34 

70 

0 

1,414 

-  1,360 

-  458 

Substituting  the  proper  sums  in  equations  (10)  we  have  the  follow- 
ing normal  equations: 

6C  +  70A  -  -  34 
70JS  =  -  1360 
70C  +  1414^  =  -  458 

Solving,  we  obtain 

A  =  -  0.10267857      B  =  -  19.428571      C  =  ~  4.46875 
from  which  it  follows,  using  (13),  that: 

a  =  -  0.0010268  6  =  -  1.8402  c  =  8290.11 

Hence  our  equation  is: 

T  =  -  0.001026802  -  1.84020  +  8290.11 

Assigning  the  given  values  to  d  we  obtain  the  computed  values  of 
T  that  are  found  in  the  last  column  of  the  Table  81. 

85.   OTHER  USEFUL  CURVES 

In  this  chapter  we  have  attempted  to  introduce  the  student  to 
some  of  the  methods  of  fitting  simple  curves  to  observed  data.  We 
have  considered  in  great  detail  the  methods  of  fitting  the  straight 
line,  the  exponential  function,  the  power  function,  and  the  quadratic 
polynomial. 

We  shall  mention  now  with  less  detail  a  few  additional  well-known 
curves  that  are  frequently  found  useful. 

A.  The  Hyperbola  ^  =  ^  +  ;|  ^^^^ 

This  equation  represents  the  hyperbola  with  the  lines  F  =  a  and 
X  =  0  as  asymptotes.   It  can  be  written  in  the  form: 
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a  +  b\ 


which  is  a  straight  line  with  slope  h  in  the  coordinates. 

Hence  we  can  state  the  test:  If  A7/A(l/Z)  is  constant,  the  data 
obey  the  law  given  by  (14). 

It  may  be  noted  that  if  a  =  0,  we  have  as  a  special  case  the  well- 
known: 


B.  The  Hyperbola 


F  = 


(15) 


a  +  bX 

This  equation  represents  the  hyperbola  with  the  lines  a  +  bX  =  0 
and  bY  =  1  as  asymptotes.  We  can  write  the  equation  in  the  form 


p  =  a  + 


(16) 


which  is  that  of  a  straight  line  with  slope  b  in  the  {X,  X/Y)  coordi- 
nates. Hence  we  can  state  the  test:  If  A{X/Y)/AX  is  constant, 
the  data  may  be  represented  by  (15). 

The  methods  for  determining  the  constants  for  (14)  and  (15) 
should  be  evident. 

C.  The  Modified  Exponential  Function  Y  =  a  +  bc^.  The  fol- 
lowing theorem  may  be  used  to  determine  if  the  modified  expo- 
nential law  is  applicable. 

Theorem:  //  the  values  of  X  are  in  arithmetic  progression  and  the 
values  of  AY  are  in  geometric  progression,  the  data  follow  the  law: 

Y  =  a  +  bc^  (17) 

Table  82 


X 

Y 

AF 

AF, 

«  Xi  +  AX 

Yi  = 

Fi  +  AF. 

AFj  =  rAFi 

=  Xi  +  2AX 

Fi  + AFi.+  rAF, 

•          •  • 

•  •  • 

•  •  • 

•          •  • 

AFn-1  =  r^-^AYi 

=  X,  +  (n  -  1)AX 

F»  = 

Fx  +  AF,  +  rAF, 

+  •  •  •  +  r'-^AFi 
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From  the  hypothesis  we  have,  since  the  values  of  AF  are  in  geo- 
metric progression: 

Fn  =  Fi  +  AFi  +  rAFi  +  r^AFi  +  •  •  •  +  r^-^AFi 
Using  the  formula  for  the  sum  of  a  geometric  progression,  we  have: 

or 

F.  =  Fx  +  ^^-j^.r-  (18) 

Further,  since  the  values  of  X  are  in  arithmetic  progression,  we 
have: 

Xn  =  Xi+  (n-  1)AX 

or 

^  AX 

Substituting  this  value  of  (n  —  1)  in  (18)  we  have: 


AFi  AFi 


Xn—Xl 


F„  =  Fi  +  -  •  r  AX 

1  —  r     1  —  r 


or 

Fn  =  a  +  bc^" 

where 

Ay  AF     -^^  JL 

a  =  Fi  +  -P^^        6  =  -  P^rAx  ,       c  =  rAX 

1  —  r  1  —  r 

To  determine  the  constants  of  this  equation  we  shall  employ  the 
method  of  selected  points.  We  draw  the  best-fitting  curve  among 
the  points.  We  now  choose  three  points  on  the  curve  whose  co- 
ordinates are  known  —  or  can  be  estimated  —  and  whose  abscissas 
are  in  arithmetic  progression.  We  can  form  three  equations  by 
substituting  the  coordinates  of  the  selected  points  in  (17),  and  solve 
for  the  unknowns. 

Exercise.  A  curve  of  the  type  Y  =  a  +  bc^  passes  through  the  three 
points  (1,  10),  (3,  28),  and  (5,  100).   What  is  its  equation? 

For  an  illustrative  example,  consider  the  data  of  Table  83  on 
page  336. 
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Table  83 


y 

J. 

AV 

Ratio  of  Any  CilY 
to  Preceding 

4 

20.1 

7.7 

5 

27.8 

7.2 

0.93 

6 

35.0 

6.5 

0.90 

7 

41.5 

6.1 

0.93 

8 

47.6 

5.4 

0.88 

9 

53.0 

5.1 

0.94 

10 

58.1 

4.6 

0.90 

11 

62.7 

4.1 

0.89 

12 

66.8 

3.8 

0.92 

13 

70.6 

We  note  that  the  values  of  X  are  in  arithmetic  progression,  that 
the  values  of  AF  are  approximately  in  geometric  progression,  and 
conclude  that  our  data  follow  the  law: 

F  =  a  + 

To  determine  the  constants,  assume  that  the  curve  passes  through 
the  points  (4,  20.1),  (8,  47.6),  and  (12,  66.8).  We  then  have  the 
equations : 

a  +  6c*  =  20.1 
a  +  6c«  =  47.6 
a  +  hc^^  =  66.8 

Then 

6c*(c*  -  1)  =  47.6  -  20.1  =  27.5 
6c«(c*  -  1)  =  66.8  -  47.6  =  19.2 

and  by  division  we  obtain: 

c*  =  0.6982 
c  =  0.9141 

By  substitution  we  have: 
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6(0.6982) (0.6982  -  1)  =  27.5 

and 

6  =  -  130.5 

Now  a  is  easily  found,  for: 

a  +  (-  130.5)  (0.6982)  =  20.1 

and 

a  =  111.2 

Hence  our  equation  is: 

Y  =  111.2  -  130.5(0.9141)^ 

Other  selections  of  the  points  will  give  slightly  different  values. 
The  computed  values  and  the  residuals  may  now  be  found. 

D.  The  Modified  Power  Function  7  =  c  +  aX^.  A  test  for  the 
applicability  of  this  law  is  contained  in  the 

Theorem:  //  the  values  of  X  form  a  geometric  progression  and 
the  values  of  AY  also  form  a  geometric  progressioUy  then  the  data  obey 
the  law: 

Y=  c  +  aX^  (19) 


Table  84 


X 

Y 

AY 

AFi 

=  rXi 

F2  = 

Fi  +  AFi 

AFj  =  fiAFi 

X, 

•     •  • 

Fj  +  AFi 

+  fiAF, 

*  «  . 

•  •  • 

•  •  • 

•  •  • 

•     •  » 

AF„_i  =  B-'-'AF, 

Yx  +  AFi 

+  72AFi 

+  • 

•  •  +  R•'■^^Y^ 

"     i'  r",i"',  .."r.  ,,■„ 

From  the  hypothesis  we  have: 

Z„  =  r»-iXx   or  „  -  1  =  l2gJf:iMZl 

log  r 

and 

7n  -  Fi  +  AFi  [1  +  ii;  +     +  .  .  .  +  5*^2] 


338  NONLINEAR  TRENDS:  CURVE-FITTING 


or  ri  _  Pn-ii 

The  remainder  of  the  proof  easily  follows,  and  we  leave  its  com- 
pletion to  the  reader. 

As  in  the  modified  exponential,  we  determine  the  constants  by 
the  method  of  selected  points  but  in  this  case  the  abscissas  should 
be  chosen  in  geometric  progression, 

86.   LIMITATIONS  OF  EMPIRICAL  EQUATIONS 

In  the  preceding  pages  of  this  chapter  we  have  been  concerned 
with  two  fundamental  questions  that  relate  to  empirical  equations: 
first,  what  type  of  equation  should  be  selected  to  describe  the  data, 
and,  having  decided  upon  the  type  of  equation,  the  second  question 
is,  how  can  the  constants  be  determined?  Having  answered  the  first 
question,  the  second  presents  no  great  difficulty. 

Once  the  equation  for  the  data  has  been  determined,  we  have  an 
expression  that  may  be  used,  within  certain  limits,  to  estimate  values 
of  the  dependent  variable  and  thus  to  compare  values  on  the  curve 
with  observed  values.  Further,  if  a  criterion  of  goodness  of  fit  is 
desired,  we  may  turn  to  the  sum  of  the  squares  of  the  residuals. 

To  assist  in  determining  the  type  of  equation  to  be  selected  we 
have  devised  tests  to  apply  to  the  observations.  The  illustrative 
examples  that  we  have  solved  have  enjoyed  a  singular  peculiarity; 
they  have  presented  data  for  which  the  tests  were  closely  satisfied. 
In  general,  the  data  have  come  from  the  laboratories  of  the  physical 
sciences  where  it  is  possible  to  restrict  the  problem  to  a  study  of  the 
variables  in  question,  and  to  control  or  eliminate  outside  influences. 
There  have  been  internal  as  well  as  mathematical  reasons  for  se- 
lecting an  equation  of  given  type  and  thus  our  empirical  equations 
have  been  "true  relations'^  between  the  variables  in  question. 

When  a  physicist  is  analyzing  a  set  of  distance,  time  data  of  the 
flight  of  a  projectile,  he  will  know  for  internal  reasons  that  his  curve 
is  a  second  degree  parabola  D  =  AT'^  +  BT  +  C,  Similarly,  a 
chemist  in  analyzing  pressure,  volume  data  would  likely  choose 
P  =  AV^,  As  a  result  of  slow  and  painful  research,  the  scientist 
learns  how  certain  phenomena  behave.  It  frequently  occurs  that  a 
study  of  empirical  data  leads  to  a  formulation  and  discovery  of  rela- 
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tionships  that  the  investigator  had  not  been  able  to  formulate  from 
analytical  considerations.  A  classic  example  of  this  method  was  the 
discovery  and  formulation  of  Kepler^s  Laws  which  explain  the  mo- 
tions of  the  planets.  These  laws  were  formulated  by  Johann  Kepler 
(1571-1630)  after  a  study  of  a  tremendous  quantity  of  observed  data 
collected  over  a  number  of  years  by  the  brilliant  astronomer,  Tycho 
Brahe  (1546-1601).  The  truths  hidden  in  the  data  were  not  revealed 
to  the  observer,  Brahe,  but  when  Kepler  analyzed  the  data  he  saw  in 
them  relationships  that  he  formulated  into  what  are  known  as 
Kepler's  Laws.   Science  is  replete  with  similar  examples. 

When  one  moves  outside  the  realm  of  physical  science,  he  has 
difficulty  in  finding  an  equation  that  explains  and  expresses  a  ''true 
relationship. Internal  evidence  is  lacking.  Too  many  uncontrol- 
lable influences  are  present  that  cannot  be  ehminated,  and  thus  our 
data  may  not  lead  to  an  analytical  formulation  of  an  inherent 
relationship.  In  biological,  educational,  economic,  and  social  rela- 
tionships our  knowledge  is  too  limited  to  enable  us  to  say  why  a 
relationship  exists.  The  best  w^e  can  do  in  these  fields  is  to  find 
a  functional  relationship  between  the  variables  in  question  for  the 
particular  data  at  hand.  Generally,  we  cannot  explain  the  why  of 
the  relationship.  In  such  cases  the  data  obviously  may  not  reveal 
that  a  certain  type  of  equation  is  indicated.  Sometimes  experience 
comes  to  the  assistance  of  the  investigator,  otherwise  he  does  what 
all  of  us  do,  namely,  the  best  he  can. 

Usually  the  purpose  of  this  functional  relationship  is  to  estimate 
sufficiently  well  the  values  of  one  variable  from  know^n  values  of 
another,  and  frequently  this  purpose  can  be  accomplished  by  using 
more  than  one  type  of  equation.  In  fact,  we  can  establish  the  functional 
relationship  without  an  equation  at  all.  If  to  each  value  of  X  there 
is  determined  one  or  more  values  of  F,  then  F  is  a  function  of  X, 
We  may  determine  the  values  of  Y  from  a  graph,  a  table  of  values, 
and  that  is  all  that  is  really  necessary.  However,  much  is  gained 
if  we  can  obtain  a  summarizing  expression  in  the  form  of  an 
equation. 

We  then  face  the  practical  problem  of  finding  a  functional  relation- 
ship. If  we  choose  to  find  an  equation,  the  curve  may  fit  poorly  or 
closely.  When  the  data  are  such  that  a  careful  analysis  is  warranted, 
they  should  be  subjected  to  a  careful  analysis;  however,  should  they 
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not  warrant  a  careful  analysis,  it  is  the  height  of  absurdity  to  subject 
them  to  such  a  treatment.  The  investigator  must  determine  the 
type  of  treatment  the  data  merit. 

In  our  previous  sections  we  have  discussed  methods  of  dealing 
with  precise  measurements  in  a  precise  manner.  In  fact,  we  have 
frequently  used  eight-place  logarithms  in  our  computations  in  order 
that  our  results  might  be  the  more  precise.  In  the  next  section  we 
discuss  methods  of  dealing  with  data  that  may  not  merit  a  careful 
analysis. 

87.   GRAPHICAL  METHODS  IN  TREND  ANALYSIS 

Frequently  workers  in  practical  statistics  are  confronted  with  data 
that  do  not  warrant  a  careful  algebraical  and  numerical  analysis. 
Rough  approximations  may  be  sufficiently  accurate  for  the  investi- 
gator's needs.  In  such  cases  he  usually  resorts  to  the  use  of  graphical 
methods.  Especially  are  graphs  widely  employed  in  trend  analysis. 
Not  only  may  the  graph  be  used  to  give  a  clew  to  the  equation 
of  the  curve  that  may  be  used  to  represent  the  trend;  it  may  even 
be  used  to  determine  the  unknown  constants  that  appear  in  the 
equation  that  is  selected. 

We  are  familiar  with  graphs  made  on  the  conventional  cross- 
section  coordinate  paper.  On  this  paper  a  given  distance  in  any 
direction,  when  applied  to  a  given  problem,  always  represents  a 
constant  quantity.  Such  paper  may  be  specifically  called  arith- 
metic paper,'*  and  the  uniform  scale  an    arithmetic  scale." 

We  may,  however,  develop  scales  on  which  equal  distances  do 
not  always  represent  equal  magnitudes.  A  very  common  and  widely 
used  scale  of  this  kind  is  the  ''logarithmic  scale''  on  which  equal 
distances  represent  equal  proportional  or  percentage  changes.  In 
this  scale  the  points  correspond  to  the  logarithms  of  numbers.  By 
placing  the  natural  numbers,  Nj  and  their  logarithms,  log  N,  into 
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correspondence,  it  is  noted  that  the  logarithms  are  spaced  uniformly 
along  the  line  while  the  integers  are  spaced  non-uniformly. 

The  scale  from  1  to  10  as  shown  on  the  line  AB  constitutes  a 
cycle.  Any  number  on  the  scale,  say  X,  corresponds  to  log  X.  That 
is,  the  logarithmic  scale  serves  the  purpose  of  finding  the  logarithms. 
By  prolonging  the  line  AB  and  repeating  the  scale,  we  may  con- 
struct a  segment  of  two  cycles. 

It  is  customary  to  assign  a  value  to  the  initial  point  A.  It  may 
be  any  number  greater  than  zero.  The  value  to  be  assigned  is 
determined  by  the  problem  at  hand.  The  value  placed  at  the  end 
of  the  cycle,  jB,  is  10  times  the  value  assigned  to  the  point,  A.  Thus, 
the  numbers  along  the  following  scale,  ABj  serve  as  illustrations. 
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A.  Arithmetic  Paper.  As  an  illustration  of  the  use  of  the  graph- 
ical method  in  determining  the  straight-line  trend  we  shall  consider 
the  data  that  were  given  in  Exercise  3,  page  315. 

Table  85.  Annual  Earnings  of  the  Associated  Gas 
AND  Electric  System,  1920-1928 


Year 

Net  Earnings 
(millions  of  dollars) 

Year 

Net  Earninas 
(millions  of  dollars) 

1920 

13.4 

1925 

29.5 

1921 

16.2 

1926 

33.5 

1922 

19.2 

1927 

37.8 

1923 

22.7 

1928 

40.6 

1924 

25.1 

1929 

We  plot  the  data  carefully  on  arithmetic  coordinate  paper  with 
X  =  0  at  1920  [Figure  44].  The  observed  points  are  indicated  by 
the  small  crosses.  We  then  sketch  in  ''by  sight the  line  of  trend. 
It  cuts  the  F-axis  at  12.5.  By  using  this  point  and  the  point  (8, 
40.5)  as  two  known  points  on  the  line,  we  obtain 

40.5  -  12.5     o  K 
m  =      8-0     ^  ^'^ 

Hence  we  have  the  equation  of  the  line  Y  =  3.5X  -|-  12.5. 
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Figure  44 


In  general  we  proceed  as  follows:  We  plot  the  data  carefully  on 
arithmetic  paper.  Next,  we  draw  in  by  sight  the  trend  line.  Then 
selecting  two  widely  separated  points  A  and  B  on  the  line,  we 
evaluate  the  ratio  of  the  difference  in  the  ordinates  to  the  difference 
of  the  abscissas  of  the  two  points.  This  gives  us  the  slope,  m,  of  the 
trend  line.  Using  this  slope  with  some  point  on  the  line  whose  co- 
ordinates are  read  from  the  graph,  we  can  find  from  the  point-slope 
form 

F  -  Fi  =  mix  -  Xi) 

the  equation  of  the  trend.  Of  course  if  the  F-intercept  can  be 
determined  from  the  graph,  we  may  use  the  slope-intercept  form 


Y  =  mX  +  b 

and  thus  find  the  equation  of  the  trend  line. 

Obviously  this  same  method  may  be  employed  for  parabolic, 
exponential,  or  other  types  of  trend.  We  choose  points  equal  in 
number  to  the  number  of  constants  in  the  equation,  substitute  the 
coordinates  in  the  chosen  equation,  and  solve  for  the  unknowns. 

B.  Semi-logarithmic  Paper.  Logarithmic  scales  may  be  used 
on  the  axes  of  coordinate  paper.  If  the  scale  on  one  of  the  axes  is 
logarithmic  and  on  the  other  is  arithmetic,  the  paper  is  called  semi- 
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logarithmic  paper.  This  type  of  paper,  usually  with  three  cycles, 
can  be  purchased  at  stationery  stores.  It  is  used  by  statisticians  in 
studying  the  growth  of  populations,  bank  clearings  —  in  short,  in 
studying  data  that  may  follow  the  exponential  function  Y  ==  ab^. 

The  following  theorems  contain  the  gist  of  the  theory. 

Theorem  1.  The  graph  of  the  exponential  function  Y  =  ab^ 
plotted  on  semi-logarithmic  paper  is  a  straight  line  whose  slope  is 
log  b  and  whose  intercept  on  the  non-uniform  scale  is  a. 

Proof:  From 

Y  =  ab"" 

we  have,  taking  logarithms, 

log  Y  =  (log  b)X  +  log  a 

which  is  an  equation  of  the  first  degree  in  the  (X,  log  Y)  coordinates, 
and  consequently  represents  a  straight  line.  The  slope  is  log  b  and 
the  vertical  intercept  is  a  on  the  log  F-axis.  That  is,  if  the  points 
(X,  Y)  plotted  on  uniform  coordinate  paper  fall  upon  the  curve 
Y  =  ab^,  when  plotted  on  semi-logarithmic  paper  they  fall  upon  the 
straight  line  log  Y  =  (log  6)X  +  log  a. 

Conversely,  if  the  points  (X,  Y)  when  plotted  on  semi-logarithmic 
paper  is  a  straight  line  with  slope  log  b  and  with  the  intercept  on  the 
non-uniform  scaled  vertical  axis  a,  the  data  follow  the  exponential 
law  F  =  ab^"^. 

We  shall  leave  the  proof  as  an  exercise  for  the  reader. 

Example  1.  Draw  the  graph  of  the  curve  F  =  2^  on  arithmetic  paper 
and  on  semi-logarithmic  paper. 

We  prepare  a  table  of  values  and  plot  the  points  as  indicated. 


Figure  45(a)  Figure  45(b) 
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It  is  noted  that  the  equation  plots  into  a  curve  on  the  arithmetic  paper 
and  into  a  straight  line  on  the  semi-logarithmic  paper.  The  equation  of 
the  straight  line  in  the  (X,  log  Y)  coordinates  may  be  written 

log  Y  =  (log  2)X  +  log  1 

in  which  the  slope  is  log  2  and  the  vertical  intercept  is  1. 
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EXERCISES 

1.  Find  the  equations  of  the  straight  lines  in  Fig.  46  in  semi-logarithmic 
form.   What  are  the  corresponding  equations  in  exponential  form? 

2.  Plot  the  following  equations  on  semi-logarithmic  paper. 

(a)  Y  =  2(3)^  (c)  log  Y  =  0.5X  +  log  3 

(b)  Y  =  2(10)''^'  (d)  Y  =  3(10)-''^' 

3.  If  $10  is  invested  at  5  per  cent  compounded  annually  the  amount  Y 
at  the  end  of  X  years  is  given  by  Y  —  10(1.05)'^.  Plot  this  curve  on  semi- 
logarithmic  paper. 

Let  us  next  employ  semi-logarithmic  paper  to  determine  graphi- 
cally the  approximate  exponential  equation  that  obtains  for  a  mass 
of  empirical  data.  We  illustrate  the  procedure  in  Example  2. 

Example  2.  Find  graphically  the  exponential  trend  of  the  gross  earnings 
in  millions  of  dollars  of  all  Bell  telephone  companies  in  the  United  States 
as  given  in  the  accompanying  table. 
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Figure  47 


Table  86 


Year 

Earnings 

1921 

521 

1922 

564 

1923 

623 

1924 

678 

1925 

761 

1926 

845 

1927 

917 

1928 

1003 

1000 
750 

500 


250 


0 


to 


2 

to 


t^ 
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7  X 
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We  choose  X  =  0  at  1921.  We  note  that  the  data,  when  plotted  on 
semi-logarithmic  paper,  lie  along  the  line  BC  which  we  draw  by  sight. 
For  this  Une  a  =  520.  Taking  the  point  6X7,1000)  as  a  second  point  on 
the  line  we  have 

,     ,      log  1000  -  log  520 
slope  =  log  6  =  


7-0 
3.0000  -  2.7160 


=  0.0406 


6  =  1.1  approximately 

The  equation  of  the  straight  Une  in  semi-logarithmic  coordinates  is 
therefore 

log  Y  =  0.0406X  +  log  520 
and  the  corresponding  exponential  equation  is 

Y  =  520(1.1)'^' 


EXERCISES 

1.  The  registration  (in  millions)  of  motor  vehicles  in  the  United  States 
in  the  given  years  is  shown  by  the  following  table.  \_StaHstical  Abstract 
of  the  U.S.,  1930,  p.  385.]  Using  semi-logarithmic  paper,  find  an  exponential 
function  that  will  approximately  fit  the  data. 


Year 

Registration 

Year 

Registration 

1917 

5.0 

1922 

12.2 

1918 

6.1 

1923 

15.1 

1919 

7.6 

1924 

17.6 

1920 

9.2 

1925 

19.9 

1921 

10.5 

1926 

22.0 
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2.  Use  semi-logarithmic  paper  to  fit  an  exponential  curve  to  the  fol- 
lowing data  which  give  the  average  number  of  shares  (in  millions)  sold 
on  the  New  York  Stock  Exchange  from  1919  to  1929  inclusive. 


Year 

Sales 

« 

Year 

Sales 

1919 

26.07 

1925 

37.69 

1920 

18.73 

1926 

37.42 

1921 

14.30 

1927 

48.08 

1922 

21.73 

1928 

76.71 

1923 

19.77 

1929 

93.75 

1924 

23.50 

3.  Use  semi-logarithmic  paper  to  fit  an  exponential  curve  to  the  follow- 
ing data  which  give  the  production  (in  miUions  of  barrels)  of  petroleum  in 
the  United  States  1920-1929.  {^Statistical  Abstract  of  the  United  States, 
1936,  p.  723.] 


Year 

Production 

Year 

Production 

1920 

443.0 

1925 

763.7 

1921 

472.2 

1926 

770.9 

1922 

557.5 

1927 

901.1 

1923 

732.4 

1928 

901.5 

1924 

713.9 

1929 

1007.3 

C.  Logarithmic  Paper.  Thus  far  we  have  used  two  types  of 
coordinate  paper  in  our  work,  arithmetic  and  semi-logarithmic.  In 
the  arithmetic  paper,  the  scale  along  both  axes  is  the  natural  scale. 
The  semi-logarithmic  paper  has  the  natural  scale  along  the  axis  of 
abscissas  and  a  logarithmic  scale  along  the  axis  of  ordinates. 

Another  useful  type  of  paper  is  logarithmic  paper.  This  paper  is 
ruled  with  logarithmic  scales  both  horizontally  and  vertically.  It 
is  frequently  called  double  logarithmic  and  log-log  paper.  When  a 
point  (X,  Y)  is  plotted  on  log-log  paper,  its  actual  distances  from 
the  reference  lines  are  proportional  to  log  X  and  log  F.  In  other 
words,  in  graphing  pairs  of  numbers  on  logarithmic  paper  we  really 
graph  the  logarithms  of  the  numbers.  The  logarithmic  paper  serves 
the  purpose  of  finding  the  logarithms  of  the  numbers.  The  effect 
of  this  is  to  tone  down  the  contrasts.   For  examples, 

log  1000  is  only  3,    and   log  0.0001  is  -  4. 
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Double  logarithmic  paper  is  very  useful  in  studying  the  power 
function 

Y  =  aX^ 

where  a  and  b  are  constants.   This  is  due  to  the  fact  that  the  graph 
of  the  power  function  on  logarithmic  paper  is  a  straight  line.  For 
we  have  the 
Theorem :  The  graph  of  the  power  function 

Y  =  aX^ 

plotted  on  logarithmic  paper  is  the  straight  line  whose  slope  is  b 
and  whose  intercept  on  the  F-axis  is  a. 
Proof :  Taking  logarithms  of  the  above  equation,  we  have 

log  F  =  6  log  X  +  log  a 

which  is  an  equation,  in  logarithmic  coordinates,  of  a  straight  line 
with  slope  b  and  F-intercept  a. 

Conversely,  if  the  (X,  F)  data  when  plotted  on  logarithmic  paper 
give  a  straight  line  with  slope  6  and  F-intercept  a,  the  data  follow 
the  law 

F  =  aX' 

Proof :  The  (log  X,  log  F)  relation  is  linear.  Hence 

log  F  =  6  log  X  +  log  a 
which  can  be  immediately  reduced  to 

F  =  aX^ 
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Example  1.  Draw  the  graph  of  F  =  SX^  on  logarithmic  paper. 

Since  the  graph  is  a  straight  line,  we  need  but  two  points  say  (1,  3)  and 
(2,  12)  to  determine  the  constants.  These  two  points  determine  the  line 
AB  of  Figure  48. 

The  slope  b  is  given  by 

5  _  log  12  -  log  3 
log  2  —  log  1 

log  4     2  log  2 


log  2 
2 


log  2 


From  the  figure  a  =  3,  hence  the  log-log  equation  is 

log  F  =  2  log  Z  +  log  3 


Figure  49 


2 


3    U    5  6     8  10  X 


Example  2.    Find  the  equation  of  the  curve  that  graphs  into  the  line 
marked  c  of  Figure  49. 
We  have 

log  8  -  log  4 
^^"P'  =  log  10  -  log  1 


log  2 


=  log  2 


6  =  log  2 

From  the  figure  a  =  4,  and  the  log-log  equation  is 

log  F  =  (log  2)  log  X  +  log  4 
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Using  the  properties  of  logarithms  we  obtain 


Example  3. 

fits  the  data: 


Using  log-log  paper  find  the  equation  that  approximately 


X 

1 

3 

5 

7 

10 

20 

40 

60 
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Y 

25 

45 

60 

70 

90 

130 

190 

240 

300 

Figure  50 
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In  solving  this  problem  we  find  it  necessary  to  use  two  cycle  log-log 
paper.  We  indicate  the  points  by  small  crosses.  Since  the  points  lie  ap- 
proximately upon  the  straight  line  ABj  the  data  may  be  approximately 
represented  by  F  =  aX^,  Assume  that  the  line  passes  through  the  points 
.4(1,25)  and  5(100,300). 

log  300  -  log  25     log  12     1.0792  ^ 

^  =  ^^^p^  =  log  100 » log  1    -y-  -2-  =  ^'^^ 

From  Figure  50  the  F-intercept  =  a  =  25.  Hence  the  log-log  equa- 
tion  is 

log  y  =  0.54  log  X  +  log  25 
from  which  we  immediately  obtain 

Y  =  25Z0-" 
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EXERCISES 

1.  Find  the  log  X,  log  Y  and  the  X,  Y  equations  of  the  lines  a,  6,  d,  e,  /, 
and  g  of  Figure  49. 

2.  Use  log-log  paper  to  determine  the  log  X,  log  Y  and  the  X,  Y  equations 
for  the  data  given  in  the  table: 


X 

4 

8 

12 

16 

20 

24 

Y 

2.Q 

23.0 

77.8 

184.0 

360.0 

622.1 

3.  Use  log-log  paper  to  determine  the  approximate  X,  Y  relation  for 
the  data  given  in  the  table; 


X 

10 

20 

30 

40 

50 

60 

Y 

11 

31 

57 

88 

122 

161 

4.  Plot  the  following  data  on  two-cycle  log-log  paper  and  determine 
the  log  Z,  log  Y  and  the  Z,  Y  relations. 


X 

5 

7 

9 

15 

20 

30 

40 

50 

Y 

1 

2 

3 

9 

16 

37 

65 

100 

5.  Solve  Exercise  4  above  using  the  method  of  averages  for  the  (log  Z, 
log  Y)  straight  line. 

6.  Use  semi-log  paper  to  find  the  X,  log  Y  and  the  X,  Y  equations  for 
the  data: 


X 

1.6 

3.1 

4.7 

6.3 

7.9 

9.4 

11.0 

Y 

5.4 

7.2 

9.6 

12.8 

17.1 

22.9 

30.8 

7.  The  number  of  bacteria  in  a  given  culture  t  hours  after  they  were 
first  observed  was  found  to  be  that  given  by  the  table.  Using  semi-log 
paper  find     in  terms  of  t. 


t 

0 

1 

2 

3 

4 

5 

6 

N 

125 

209 

340 

561 

924 

1525 

2512 

8.  The  number  N  of  bacteria  in  a  culture  at  the  end  of  t  hours  is  shown 
by  the  following  table.  Use  semi-log  paper  to  find  N  in  terms  of  t 


t 

0 

1 

2 

3 

4 

5 

6 

N 

100 

162 

265 

450 

742 

1230 

2020 

GRAPHICAL  METHODS  IN  TREND  ANALYSIS  351 


9,  The  annual  expenditure  of  the  United  States  Government  (in  millions 
of  dollars)  has  increased  as  in  the  table.  Use  semi-log  paper  to  determine 
the  appropriate  law.  Would  you  advise  using  this  law  to  extrapolate  for 
the  expenditure  in  1918? 


Year 

Expenditure 

Year 

Expenditure 

1840 

24 

1880 

265 

1850 

41 

1890 

298 

1860 

63 

1900 

488 

1870 

294 

1910 

660 

10.  The  total  assets  (in  billions  of  dollars)  of  Building  and  Loan  Asso- 
ciations in  the  United  States  for  the  given  years  are  shown  in  the  following 
table.  Use  semi-log  paper  to  find  the  X,  log  Y  and  the  X,  Y  equations. 


Year 

X 

Assets  Y 

Year 

X 

Assets  Y 

1920 

0 

2.52 

1925 

5 

5.51 

1921 

1 

2.89 

1926 

6 

6.33 

1922 

2 

3.34 

1927 

7 

7.18 

1923 

3 

3.94 

1928 

8 

8.02 

1924 

4 

4.77 

1929 

9 

8.70 

11.  The  following  table  gives  the  average  monthly  imports  of  wood 
pulp  (millions  of  short  tons)  into  the  United  States  for  the  given  years. 
Choose  X  —  0  at  1926  and  find  the  straight-Hne  equation  by  selected 
points.  Extrapolate  for  the  years  1931,  1932,  and  1933.  The  actual 
imports  these  years  were  133.0,  123.5,  and  161.8  short  tons. 


Year 

Imports 

Year 

Imports 

1922 

105 

1927 

140 

1923 

115 

1928 

147 

1924 

127 

1929 

157 

1925 

139 

1930 

152 

1926 

145 

12.  The  following  table  gives  the  production  of  women's  shoes  (in 
millions  of  pairs)  for  the  given  years.  Plot  the  data  on  semi-logarithmic 
paper  and  determine  the  X,  log  Y  and  the  X,  Y  relations  using  X  =  0 
at  1931.  Find  the  extrapolated  value  for  1940.  The  actual  value  was 
12.5  million  pairs. 
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Year 

X 

Production 

Year 

X 

Production 

1931 

0 

9.4 

1936 

5 

13.5 

1932 

1 

9.5 

1937 

6 

12.5 

1933 

2 

10.9 

1938 

7 

12.3 

1934 

3 

11.1 

1939 

8 

14.0 

1935 

4 

12.1 

1940 

9 

13.  The  following  table  gives  (in  millions  of  pounds)  the  domestic 
consumption  of  rayon  in  the  United  States  from  1920  to  1936.  Plot  on 
semi-logarithmic  paper  with  X  =  0  at  1920,  and  find  graphically  the  X, 
log  Y  and  the  X,  Y  equations.  Find  the  extrapolated  value  for  1937. 
The  actual  value  was  261.2  million  pounds. 


Year 

X 

Consumption 

Year 

X 

Consumption 

1920 

0 

9 

1929 

9 

131 

1921 

1 

20 

1930 

10 

118 

1922 

2 

25 

1931 

11 

157 

1923 

3 

33 

1932 

12 

152 

1924 

4 

42 

1933 

13 

212 

1925 

5 

58 

1934 

14 

195 

1926 

6 

61 

1935 

15 

253 

1927 

7 

100 

1936 

16 

298 

1928 

8 

100 

1937 

17 

14.  Plot  the  data  of  Exercise  13  above  on  arithmetic  paper  and  use 
the  method  of  selected  points  to  find  the  equation  of  the  parabola  Y  —  AX'^ 
+  BX  +  C  that  will  approximately  fit  the  data.  Choose  Z  =  0  at  1920. 
Extrapolate  for  1937. 

15.  The  following  table  gives  the  annual  production  of  cigarettes  (bil- 
lions) in  the  United  States  in  the  given  years.  Use  semi-logarithmic  paper 
to  find  the  A^,  log  Y  and  the  X,  Y  equations.  Choose  X  =  0  at  1920,  juid 
let  Y  =  Production.  Find  the  extrapolated  value  for  1930.  The  actual 
value  for  1930  was  123.8  billions. 


Year 

Annual  Production 
(billions) 

Year 

Annual  Production 
(billions) 

1920 

47.4 

1925 

82.2 

1921 

52.1 

1926 

92.1 

1922 

55.8 

1927 

99.8 

1923 

66.7 

1928 

108.7 

1924 

72.7 

1929 

122.3 
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16.  Find  the  trend  line  for  the  changing  price  of  beef  as  described  in  the 
data  of  Table  12  (p.  47). 

17.  In  the  following  table  the  unit  is  1,000,000  barrels  of  42  gallons. 


Annual  Production  of  Petroleum  in  the 
United  States,  1900-1913 


Year 

Production 

Year 

Production 

Year 

Production 

1900 

63.6 

1905 

134.7 

1910 

209.6 

1901 

69.4 

1906 

126.5 

1911 

220.4 

1902 

88.8 

1907 

166.1 

1912 

222.9 

1903 

100.5 

1908 

178.5 

1913 

248.4 

1904 

117.1 

1909 

183.2 

Find  the  equation  of  the  trend  line,  the  computed  values  of  the  produc- 
tion for  the  given  years,  and  the  residuals.  Find  the  predicted  values  for 
the  years  1915  and  1920  and  compare  your  results  with  those  given  in 
Commerce  Yearbook,  1930,  page  293,  which  are  as  follows:  1915,  production, 
281.1;  1920,  production,  442.9.  What  can  you  say  for  the  trend  line  for 
purposes  of  prediction? 

18.  Find  the  equation  of  the  trend  line  (a)  excluding  the  years  1916, 
1917,  and  1918,  and  (b)  including  these  years.  Find  the  computed  produc- 
tion and  the  residuals  in  each  case. 


Average  Monthly  Production  of  Pig  Iron  in  the 
United  States,  1903-1918  ^ 


Production 

Production 

Production 

Year 

{1000 

Year 

{1000 

Year 

{WOO 

long  tons) 

long  tons) 

long  tons) 

1903 

1,452 

1909 

2,116 

1914 

1,921 

1904 

1,344 

1910 

2,237 

1915 

2,472 

1905 

1,882 

1911 

1,944 

1916 

3,252 

1906 

2,066 

1912 

2,448 

1917 

3,182 

1907 

2,109 

1913 

2,560 

1918 

3,209 

1908 

1,302 

19.  Find  the  equation  of  the  trend  line,  the  computed  values  of  the 
production,  and  the  residuals. 

^  The  data  are  taken  from  Review  of  Economic  Statistics,  Vol.  I,  p.  66;  United 
States  Department  of  Commerce,  Survey  of  Current  BuMness,  No.  42,  p.  44. 
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Total  Production  of  Crude  Steel,  1900-1929^ 


Production 

Production 

Production 

Production 

Year 

{millions  of 
long  tons) 

Year 

/       '11*  i» 

{millions  of 
long  tons) 

Year 

/      •11*  1* 

{millions  of 
long  tons) 

Year 

{millions  of 
long  tons) 

1900 

10.6 

1908 

14.0 

1916 

42.8 

1923 

44.9 

1901 

13.5 

1909 

24.0 

1917 

45.1 

1924 

37.9 

1902 

14.9 

1910 

26.1 

1918 

44.5 

1925 

45.4 

1903 

13.9 

1911 

23.7 

1919 

34.7 

1926 

48.3 

1904 

13.9 

1912 

31.3 

1920 

42.1 

1927 

44.9 

1905 

20.0 

1913 

31.3 

1921 

19.8 

1928 

51.5 

1906 

23.4 

1914 

23.5 

1922 

35.6 

1929 

56.4 

1907 

23.4 

1915 

32.2 

88.  GOODNESS  OF  FIT  OF  CURVES  TO  OBSERVED 
DATA:  NONLINEAR  CORRELATION 

A.  Goodness  of  Fit.  The  investigator  who  takes  the  time  to 
derive  an  empirical  formula  for  a  set  of  observed  data  is  naturally 
interested  in  knowing  how  well  the  curve  fits  the  observations.  He 
therefore  will  always  find  the  computed  values  of  the  dependent 
variable  by  his  formula,  and  usually  the  F-residuals  if  Y  is  the 
dependent  variable. 

Any  F-residual,  it  will  be  recalled,  is  given  by  pi  where 

Pi  =  the  observed  Yi  —  the  computed  Yi 

The  variation  in  the  residuals  may  be  measured  by  their  mean 
deviation  or  by  their  standard  error.   That  is  by: 

M.D.  of  p  =  ^ 
or  by   

▼  n 

If  the  constants  have  been  found  by  the  method  of  selected  points 
or  by  the  method  of  averages,  the  mean  deviation  is  adequate,  but 

1  The  data  are  taken  from  Statistical  Abstract  of  the  United  States,  1918,  p.  251 ; 
Urid.,  1930,  p.  756. 
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if  the  constants  have  been  determined  by  the  method  of  least  squares, 
Sy  is  the  natural  measure.  In  either  case  the  results  will  be  expressed 
in  the  given  Y  unit. 

In  Section  63  (p.  238)  while  evaluating  Sy  for  the  line  Y  =  mX  +  6, 
which  has  been  fitted  by  least  squares,  we  found 


Sy  =  (7y  Vl  -  r2 

where  r  =  ^ —  is  the  cross-product  formula  for  measuring  linear 

correlation.  We  also  found  r  to  be  an  excellent  measure  of  the  good- 
ness of  fit  of  the  points  to  the  derived  line.  If  the  formula  above  is 
solved  for  r,  wc  have 

I  ? 

correlation  based  upon  the  straight  line  =  r  =  y  1  1  (20) 

where  Sy  is  the  standard  error  of  estimate  based  upon  the  straight 
line. 

B.  Nonlinear  Correlation.  The  process  of  arriving  at  a  coefficient 
of  correlation  based  upon  curvilinear  regression  is  comparatively 
simple  in  })rinciple  but  often  becomes  very  complex  in  practice.  To 
emphasize  the  evident  simplicity  of  the  process  let  us  proceed  exactly 
as  we  did  in  Section  G3  and  find  a  coefficient  of  correlation  based 
upon  the  paralx)la 

y  =  ax^ 

where  x  and  y  are  deviations  of  X  and  Y  from  their  respective  means, 
Mx  and  il/y.   Since  any  ?/-residual  is  given  by 

pi  =  Vi  -  ax\ 

we  have 

Spl  =  a22x4  ~  2aExlyi  +  Zy] 

which  is  a  quadratic  in  a.  Now  Sp^  is  a  minimum  when 

_  —  (  —  2'Zxhj)  _  2x^2/ 
^  "       22^^  2^^ 

Hence  the  best-fitting  curve  is  given  by 


.2 


where  a  is  computed,  of  course,  from  the  observed  values. 
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For  this  value  of  a,  the  sum  of  the  squares  of  the  residuals  becomes: 


and  SI  ==  —  becomes: 


Now  evidently  a  coefficient  of  correlation  based  upon  the  parabola 
y  =  ax^  is  the  expression: 

Note  that  in  this  case 

the  coefficient  of  correlation  based  on  the  parabola  =  1/1  \ 

T  (Ty 

where  >Sy  is  the  standard  error  of  estimate  for  the  parabola. 

In  order  to  emphasize  that  this  simple  method  will  not  always 
work,  let  the  student  undertake  its  application  to  the  curves: 

y  =  x*^    and   y  =  a"" 

He  will  soon  discover  that  *Hhe  method  is  simple  in  principle  but  very 
complex  in  practice/^ 

However,  for  any  curve  which  can  be  fitted  to  observed  data  we 
can  always  find  Sj,  by  the  definition: 

S  -  4 /^[^^  observed  —  Y  computed  J 
^     V    the  number  of  observations 

and  we  can  find  Cy  by  the  methods  of  Chapter  4.  We  can  therefore 
define  as  a  measure  of  correlation  based  upon  any  such  curve  the 
function  ^ 


.?2 

measure  of  correlation  based  upon  any  curve  —  v  ^  2 

cr  Y 


where  &y  is  the  standard  error  of  estimate  for  the  curve,  and  <jy 
is  the  standard  deviation  of  the  given  Y  measures.  This  measure 
of  correlation  has  been  called  the  index  of  correlation,  and  is  denoted 
by: 

PXY 

1  For  a  test  of  linearity  of  regression,  see  Rietz  and  others,  oj),  cit.,  p.  131. 


GOODNESS  OF  FIT 


357 


The  limits  of  p^y  are  0  and  1,  a  value  of  0  indicating  no  relationship 
based  upon  the  given  function  and  a  value  of  1  denoting  perfect 
relationship.   In  general: 

No  positive  or  negative  sign  should  be  attached  to  p^yy  ^^r  the  relation- 
ship might  be  positive  over  part  of  the  range  and  negative  over  other  parts.^ 

If  the  given  curve  is  a  straight  line,  then 

It  seems  hardly  necessary  to  state  that  if  correlation  is  measured  by 
pxYj  the  curve  to  which  it  applies  should  always  be  stated.  In  the 
case  of  r  no  statement  is  necessary  for  it  is  generally  understood 
that  r  is  based  upon  linear  regression. 

EXERCISES 

1.  Fit  a  straight  Une  to  the  data  of  the  following  table. 


Patients  in  New  York  State  Hospitals  for  the 

Insane,  1910-1931  ^ 


Year 

Number  of 
Patients  per 
1,000,000 
Population 

Year 

Number  of 
Patients  per 
1,000,000 
Population 

1910 

35.6 

1922 

40.2 

1913 

36.9 

1925 

41.6 

1916 

38.1 

1928 

43.3 

1919 

38.8 

1931 

45.0 

2.  In  a  certain  gas-pressure  experiment  the  following  results,  in  which 
V  is  the  volume  corresponding  to  the  pressure  p,  were  obtained.  Fit  an 
appropriate  curve  to  the  data. 


p 

33.13 

40.44 

50.48 

59.30 

67.08 

74.36 

V 

12.20 

9.45 

7.55 

6.47 

5.65 

5.07 

3,  The  following  table  gives  the  number  of  divorces  per  1,000  marriages 
during  the  given  years.  Fit  a  curve  of  the  type  Y  =  aX^  +  hX  +  c  to 
the  data.   (Choose  X  =  0  at  1910.) 

1  F.  C.  Mills,  Statistical  Methods,  Revised,  p.  408. 
*  The  data  are  from  World  Almanac,  1932,  p.  534. 
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Divorces  in  the  United  States,  1890-1930  ^ 


Year 

Number  of 
Divorces  per 

1,000 
Marriages 

Year 

Number  of 
Divorces  per 

1,000 
Marriages 

1890 

62 

1915 

104 

1895 

67 

1920 

134 

1900 

81 

1925 

148 

1905 

84 

1930 

170 

1910 

88 

4.  The  following  table  gives  the  number  of  divorces  per  1,000  popula- 
tion during  the  given  years.  Fit  a  curve  of  the  type  Y  =  ab^  to  these 
data.  What  are  the  computed  values  for  the  years  1915  and  1928?  The 
actual  values  were  1.05  and  1.66. 


Divorces  in  the  United  States,  1870-1930  ^ 


Year 

Number  of 
Divorces  per 

1,000 
Population 

Year 

Number  of 
Divorces  per 

1,000 
Population 

1870 

0.28 

1910 

0.90 

1880 

0.39 

1920 

1.60 

1890 

0.53 

1930 

1.56 

1900 

0.73 

5.  The  following  table  gives  the  number  of  grams  S  of  anhydrous  am- 
monium chloride  which,  dissolved  in  100  grams  of  water,  makes  a  saturated 
solution  of  9°  absolute  temperature.  Fit  an  appropriate  curve  to  the  data. 


6 

273 

283 

288 

293 

313 

333 

353 

373 

S 

29.4 

33.3 

35.2 

37.2 

45.8 

55.2 

65.6 

77.3 

6.  The  velocity  of  water  in  feet  per  second  in  the  Mississippi  river 
was  measured  at  various  depths,  and  the  ratios,  D,  of  the  measured 
depth  to  the  depth  of  the  river  were  computed.   Fit  a  curve  of  the  type 

^  The  data  are  from  Statistical  Abstract  of  the  United  States,  1932,  p.  87. 
2  The  data  are  from  World  Almanac,  1932,  p.  444. 
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V  aD^  +  bD  +  c.  Find  the  computed  V  when  D  =  0.9.  The  ob- 
served value  was  V  =  2.9759. 


D 

0 

0.1 

0.2 

0.3 

0.4 

0.5 

0.6 

0.7 

0.8 

V 

3.1950 

3.2299 

3.2532 

3.2611 

3.2516 

3.2282 

3.1807 

3.1266 

3.0594 

7.  The  following  table  gives  the  temperature  ^  of  a  vessel  of  cooling 
water  at  the  end  of  t  minutes.  Show  that  the  data  may  be  appropriately- 
fitted  to  ^  =  c  +  o6'.  Find  the  values  of  a,  6,  and  c  and  the  computed 
values  of  6. 


t 

0 

1 

2 

3 

5 

7 

10 

15 

20 

e 

92.0 

85.3 

79.5 

74.5 

67.0 

60.5 

53.5 

45.0 

39.5 

8.  For  the  data  of  the  following  table  find  the  exponential  curve  which 
appropriately  describes  the  trend.  Find  the  amount  in  force  in  1930 
computed  by  the  trend  and  compare  the  result  with  107.9,  which  was  the 
actual  value. 

Life  Insurance  in  Force  in  the  United  States,  1880-1928  ^ 


Year 

Total  Amount 
{billions) 

Year 

Total  Amount 
{billions) 

1880 

1.6 

1915 

22.8 

1890 

4.0 

1920 

42.3 

1900 

8.6 

1925 

71.7 

1905 

13.4 

1928 

95.2 

1910 

16.4 

1930 

•  •  •  • 

9.  The  indicated  horse-power,  /,  required  to  drive  a  ship  of  displace- 
ment D  tons  at  a  ten-knot  speed  is  given  by  the  following  data.  Justify 
the  use  of  the  curve  /  =  aD^.   Fit  this  curve  to  the  data. 


D 

1,720 

2,300 

3,200 

4,100 

I 

655 

789 

1,000 

1,164 

10.  For  the  data  of  the  following  table  fit  a  parabola  Y  —  aX^  +  bX^ 
+  cX  +  d.  (Choose  X  =  0  at  1920.)  Use  the  derived  formula  to  predict 
the  number  of  failures  in  1931,  and  compare  with  the  actual  number,  28.3. 

1  The  data  are  from  Statistical  Abstract  of  the  United  States,  1932,  p.  283. 
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Commercial  Failures  in  the  United  States,  1910-1930  ^ 


X  CUT 

Number  of  Failures 
{thousands) 

Number  of  Failures 
(thousands) 

1910 

12.6 

1921 

19.7 

1911 

13.4 

1922 

23.7 

1912 

15.5 

1923 

18.7 

1913 

16.0 

1924 

20.6 

1914 

18.3 

1925 

21.2 

1915 

22.2 

1926 

21.8 

1916 

17.0 

1927 

23.1 

1917 

13.9 

1928 

23.8 

1918 

10.0 

1929 

22.9 

1919 

6.5 

1930 

26.4 

1920 

8.9 

1931 

•  «  • 

11.  Using  the  method  of  Section  63  (p.  237),  show  that  a  coefficient  of 
correlation  based  upon  the  parabola  Y  =  aVX  is    ,  • 

12.  Show  that  a  coefficient  of  correlation  based  upon  the  equilateral 

hyperbola  a:?/  =  o  is 


13.  Find  the  correlation  coefficient  based  upon  xy  —  a  for  the  data  of 
Table  55  (p.  242).  Compare  your  result  with  the  correlation  based  upon 
linear  regression. 

14.  Fit  an  appropriate  curve  to  the  data  of  Exercise  18  (p.  106). 

15.  What  law  will  satisfactorily  represent  the  following  data?  Find  the 
values  of  the  constants  for  the  curve  selected. 


X 

y 

X 

y 

2 

12.83 

8 

19.95 

3 

13.48 

9 

22.31 

4 

14.28 

10 

25.24 

5 

15.28 

11 

28.87 

6 

16.52 

12 

33.37 

7 

18.05 

13 

38.44 

*  The  data  are  from  Siaiistical  Abstract  of  the  United  States^  1932,  p.  295. 
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16.  Show  that  if  7  =  aZ^  +  6X  +  c  is  selected  to  represent  a  mass 
of  observed  data,  the  equations  for  the  determination  of  the  coxistants  by 
the  method  of  moments  (see  Section  69B,  p.  219)  are  those  given  by  (10). 

17.  The  curve  F  =  c  +  aX^  passes  through  the  points  (2,  11.5),  (4, 
18.8),  and  (8,  39.7).   Determine  a,  6,  c.  Find  Y  when  X  =  5. 

18.  The  curve  Y  —  c  +  ab^  passes  through  the  points  (2,  5.3),  (4,  12.8), 
and  (6,  30.2).   Determine  a,  6,  c.   Find  Y  when  X  =  3. 

19.  Does  the  point  (M^,  My)  he  on  the  parabola  Y  =  AX^  +  BX  +  C 
if  it  is  fitted  by  least  squares? 


Chapter  U 


PERMUTATIONS,  COMBINATIONS,  AND 

PROBABILITY 

89.  INTRODUCTION 

In  Section  2  of  this  text  we  indicated  that  the  solution  of  a  general 
statistical  problem  may  be  divided  into  four  parts:  (1)  the  collection 
of  the  data,  (2)  its  organization,  (3)  its  analysis,  and  (4)  the  inter- 
pretation of  the  results  of  the  analysis.  The  earUcr  chapters  have 
been  devoted  primarily  to  the  steps  of  organization  and  analysis. 
Given  masses  of  numerical  data,  we  have  learned  to  present  them  in 
suitable  tabular  form,  to  represent  them  with  graphic  devices  which 
emphasize  some  of  the  significant  features,  and  to  effect  numerical 
analyses  the  results  of  which  —  when  properly  interpreted  —  present 
numerical  descriptions  of  the  groups. 

In  our  previous  discussion  we  have  analyzed  a  large  number  of 
frequency  distributions  that  were  derived  from  several  fields:  biology, 
education,  sociology,  economics,  psychology,  engineering.  Each 
distribution  has  presented  a  specific  problem  and  has  been  analyzed 
as  a  specific  problem.  We  have  thus  far  made  but  little  attempt  at 
generalization.  Our  method  has  been  the  method  of  science:  ob- 
servation, classification,  analysis.  We  now  approach  the  final  step, 
generalization. 

In  order  to  extend  our  method  beyond  the  analysis  of  a  specific 

group  of  data,  we  are  now  about  to  enter  upon  a  study  of  problems 

that  are  rather  theoretical.    It  must  not  be  assumed  that  because 

the  problems  are  theoretical  they  are  impractical.    We  shall  find 

that  they  are  decidedly  practical.    The  first  theoretical  problem  to 

which  we  shall  give  attention  will  be  the  development  of  some 

general  laws  to  describe  frequency  distributions,  the  point  binomial 

and  the  normal  curve,  that  are  usually  spoken  of  as  laws  of  chance. 

We  shall  then  be  in  a  position  to  compare  theory  with  observation 

and  to  determine  whether  the  differences  between  theory  and  ob- 
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servation  are  such  as  may  be  accounted  for  by  causes  other  than 
chance. 

The  reader  has  doubtless  noted  that  statistical  measurements  when 
gathered  in  fairly  large  numbers,  although  possessing  considerable 
variation,  show  a  quality  of  orderliness  that  is  at  times  amazing.  As 
we  pass  along  the  scale  of  measurement  of  a  variable,  from  the 
smallest  magnitude  to  the  largest,  we  find  orderliness  in  the  change  in 
the  frequency.  Most  commonly  the  frequency,  relatively  small  at 
the  lower  end  of  the  range,  increases  regularly  until  a  maximum  is 
reached  in  the  central  portion  of  the  range  then  diminishes  regularly 
toward  zero  at  the  upper  end  of  the  range. 

This  behavior  in  variation  in  observed  phenomena  was  first  ap- 
preciated by  the  mathematical  astronomer,  Pierre  Simon  Laplace, 
(1749-1827)  to  the  degree  that  he  expressed  the  behavior  by  a 
mathematical  function  known  as  the  normal  law.  The  law  had  been 
previously  discovered  by  the  mathematician,  Abraham  de  Moivre, 
(1667-1754)  in  1733  as  an  adventure  in  pure  mathematics  to  explain 
the  probabilities  of  games  of  chance.  Carl  Friedrich  Gauss  (1777- 
1855)  made  use  of  it  and  thus  gave  it  the  approval  of  a  very  great 
mathematician.  The  application  of  this  function  to  biological  vari- 
ations was  soon  appreciated  by  the  Belgian  scientist,  Adolphe  Quetelet 
(1796-1874).  The  normal  law  has  thus  become  a  foundation  stone 
in  the  modern  statistical  structure.  That  it  would  someday  be  used 
in  the  solution  of  biological,  social,  and  economic  problems  and  be 
invoked  in  countless  investigations  of  the  sciences  was  of  course 
never  dreamed  or  imagined  by  its  discoverer. 

The  second  theoretical  problem,  one  to  which  we  have  alluded 
several  times  in  the  text  and  to  which  we  shall  devote  further  atten- 
tion, is  what  may  be  termed  the  problem  of  sampling.  We  have  seen 
that  we  may  describe  a  mass  of  quantitative  data  as  precisely  as 
we  please  by  computing  for  the  data  certain  statistical  constants. 
These  constants  give  a  condensed  description  of  the  group  in  terms 
of  the  group^s  characteristics.  Among  the  tremendous  gains  realized 
by  this  summary,  not  the  least  important  is  this:  the  summary 
makes  possible  the  comparison  of  the  characteristics  of  the  individual 
with  the  characteristics  of  the  group  of  which  he  is  a  part. 

This  group  that  is  measured  and  analyzed  is  usually  a  sample, 
a  small  part  of  a  larger  universe  or  parent  population  that  is  impossible 
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or  impracticable  to  measure.  Generally,  we  desire  to  use  the  results 
of  the  study  of  the  sample  to  make  estimates  of  the  constants  that 
statistically  describe  the  universe.  This  process  is  called  statistical 
inference  or  statistical  induction.  It  is  the  problem  of  inferring  the 
characteristics  of  the  universe  from  the  characteristics  of  the  sample^ 
and  measuring  the  reliability  of  the  inferences.  This  problem  may 
be  stated  as  a  question:  To  what  extent  is  Af,  (T,  or  any  other  con- 
stant computed  from  a  sample  of  N  observations  randomly  made 
from  a  universe  trustworthy  as  the  mean,  standard  deviation,  or 
other  value  of  the  universe?  The  answer  to  this  question  constitutes 
what  we  term  the  interpretation  of  statistical  results,^ 

Statistical  induction  is  literally  permeated  with  questions  that  re- 
late to  the  theory  of  probability,  and  in  order  to  understand  enough 
of  this  science  to  appreciate  its  widespread  applications  we  shall 
now  introduce  the  student  to  the  simplest  ideas  of  the  theory. 

In  the  present  chapter  we  shall  consider  certain  elementary  notions 
of  probability.  These  notions  we  shall  approach  along  the  avenue 
of  permutations  and  combinations.  We  shall  undertake  to  give  a 
thorough  and  much  needed  drill  in  a  number  of  important  algebraic 
concepts  which  will  find  repeated  application  in  the  chapters  that 
follow.  Permutations  and  combinations  will  lead  us  to  the  point 
binomial,  which  in  turn  will  serve  to  introduce  us  to  the  normal 
probability  curve.   Thus  we  start  with  the  notion  of  a  permutation. 

90.  PERMUTATIONS 

A  permutation  is  an  order  or  an  arrangement  of  all  or  a  part  of  a 
number  of  things. 

Thus,  the  permutations  of  the  three  letters  a,  6,  c,  taken  all  at  a 
time  are:  a  6  c,    a  c  6,    6  a  c,    6  c  a,    c  a  6,    cb  a. 

It  is  seen  that  3  objects  can  be  arranged  linearly  in  3  •  2  =  6  dif- 
ferent ways.  We  might  reason  in  the  following  manner.  There  are 
3  places  to  be  filled.  The  first  place  can  be  filled  in  3  ways,  and 
with  each  of  these  the  second  place  can  be  filled  in  2  ways.  Hence 
the  2  places  can  be  filled  in  6  ways.   With  each  of  these  6  ways  of 

^  So  important  is  this  aspect  of  our  study  that  some  writers  devote  practically 
their  entire  treatments  to  it.  For  examples,  see  the  texts  by  R.  A.  Fisher  and 
by  Alan  E.  Treloar  which  are  listed  in  Appendix  A. 
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filling  the  first  2  places  there  is  1  way  of  filling  the  last  place,  hence 
3  •  2  •  1  ways  in  all. 

This  example  illustrates  the  following: 

Fundamental  Principle.  //  one  thing  can  be  done  in  m  waySj  and 
iff  after  this  is  done  in  one  of  these  ways,  a  second  thing  can  be  done 
in  n  ways,  then  the  two  together  can  be  done  in  mn  ways. 

The  foregoing  principle  may  be  extended  into  the 

Theorem.    //  one  thing  can  be  done  in  mi  ways,  a  second  in 
ways,  a  third  in  rriz  ways,  and  so  on,  the  number  of  different  ways  in 
which  they  can  be  done  when  taken  all  together  in  the  order  stated  is 

mm2ms  ....  , %u^^^c/  ^"^^^ 

Example  1.  How  many  (thrceHciiKit  numbers  )can  be  formed  from  the 

digits  1,  2,  3,  4,  5  if  each  digit  is  used  only  once? 

The  first  place  can  be  filled  in  5  ways,  and  after  that  is  done  the  second 

place  can  be  filled  in  4  ways,  and  then  the  third  place  in  3  ways.  Hence, 

we  can  form  5  •  4  •  3  =  60  different  numbers  of  the  specified  kind. 

Example  2.  How  many  three-digit  numbers  can  be  formed  from  the 
digits  1,  2,  3,  4,  5  if  each  digit  can  be  repeated? 

The  first  place  can  be  filled  in  5  ways,  and  after  that  is  done  the  second 
place  can  be  filled  in  5  ways,  and  then  the  third  place  in  5  ways.  Hence, 
we  can  form  5  •  5  •  5  =  125  different  numbers  of  the  specified  kind. 

Example  3.  How  many  three-digit  even  numbers  can  be  formed  from 
the  digits  1,  2,  3,  4,  5  if  each  digit  is  used  only  once? 

The  unit's  place  can  be  filled  in  two  ways  (either  with  the  2  or  4).  The 
ten^s  place  can  then  be  filled  in  4  ways  and  the  hundred^s  place  in  3  ways. 
In  all,  there  are  3  •  4  •  2  =  24  numbers  of  the  specified  kind. 

Example  4.  In  an  introductory  course  in  statistical  analysis  there  are 
four  lecture  sections,  A,  B,  C,  D,  and  three  laboratory  sections,  X,  Y,  Z. 
In  how  many  ways  may  a  student  choose  a  section  in  each? 

He  may  choose  the  lecture  section  in  4  ways  and  the  laboratory  section 
in  3  ways.   He  may  choose  both  in  4  •  3  =  12  ways. 

Question.  In  an  election  there  are  three  candidates  for  mayor  and 
four  candidates  for  treasurer.  In  how  many  ways  can  a  ballot  be  marked 
for  both  of  these  offices? 
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2.  If  3  coins  are  tossed,  in  how  many  ways  can  they  fall? 
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-8.  If  2  dice  are  thrown,  in  how  many  ways  can  they  fall? 
--4.  If  2  dice  and  3  coins  are  tossed,  in  how  many  ways  can  they  fall? 
-6.  How  many  signals  can  be  made  by  hoisting  3  flags  if  there  are  9 
different  flags  from  which  to  choose?  • 

6.  In  how  many  different  ways  can  3  positions  be  filled  by  selections 
from  15  different  people? 

7.  How  many  four-digit  numbers  can  be  formed  from  the  numbers  1,  2, 
3,  4,  5,  6,  7,  8,  9? . 

91.  NUMBER  OF  PERMUTATIONS 

In  the  preceding  section  we  wrote  the  permutations  of  the  three 
letters  a,  6,  c,  taken  all  at  a  time.  We  may  also  write  the  permu- 
tations of  the  same  three  letters  taken  two  at  a  time.  They  are 
abf  aCf  ba^  bcj  ca^  cb. 

Now  let  us  consider  the  general  problem :  the  number  of  permuta- 
tions of  n  things  taken  r  at  a  time  {r  S  n).  The  number  of  permuta- 
tions of  n  things  taken  r  at  a  time  is  denoted  by  nPr  and  is  given 
by  the  formula:^ 

nPr  =  n{n  -  l)(n  ~  2)  •  •  •  (n  -  r  +  1)  =  j^^^^Ty. 

There  are  r  places  to  fill  and  n  things  from  which  to  choose.  The 
first  place  may  be  filled  in  n  ways,  the  second  in  (n  —  1)  ways, 
the  third  place  in  (n  —  2)  ways,  and  so  on.  The  rth  place  may  be 
filled  in  (n  —  r  +  1)  ways.  Applying  the  theorem  of  Section  90, 
we  immediately  have  (1). 

If  all  n  things  are  taken  n  at  a  time,  n  =  rj  and  we  have: 

nPn  =  n(n  -  l)(n  -  2)  •  •  •  3  •  2  •  1  =  n !  (2) 

Since  nPn  =  n!  =  7  ^-n  =  tt^;  the  use  of  the  second  form  of  (1) 

(n  —  n) !      0 ! 

when  n  -  r  requires  that  we  define  0 !  to  equal  unity. 

It  frequently  happens  that  some  restrictions  are  imposed  upon  the 

number  of  permutations  we  are  seeking.    Whenever  any  restriction 

exists,  it  is  important  to  consider  the  restricted  groups  first.  The 

method  is  illustrated  by  the  following: 

Example.  How  many  six-place  numbers  can  be  found  from  the  digits 
1,  2,  3,  4,  5,  6,  if  3  and  4  are  always  to  occupy  the  middle  two  places? 

The  two  digits,  3  and  4,  can  be  arranged  in  2 !  ways.  The  other  four  digits 
can  be  arranged  in  4 1  ways.   Hence  in  all  2 !  4  !=  48  numbers. 

inl^l'2'3  •  •  •  nis  called  factorial  n. 
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EXERCISES 

1.  How  many  different  numbers  less  than  1,000  can  be  formed  from  the 
digits  1,  2,  3,  4,  5,  6? 

2.  Five  persons  enter  a  car  in  which  8  seats  are  vacant.  In  how  many 
ways  can  they  be  seated? 

3.  In  how  many  ways  can  10  boys  stand  in  a  row  when: 

(a)  a  given  boy  is  at  a  given  end?  (b)  a  given  boy  is  at  an  end? 
(c)  two  given  boys  are  always  together?  (d)  two  given  boys  are  never 
together? 

4.  In  how  many  ways  can  3  different  algebras  and  4  different  geometries 
be  arranged  on  a  shelf  so  that  the  algebras  are  always  together? 

5.  Find  the  number  of  permutations,  P,  of  the  letters  a  a  b  b  h  taken  5 
at  a  time.   Hint:  P-2I-3I  =  51 

6.  If  P  represents  the  number  of  distinct  permutations  of  n  things,  taken 
all  at  a  time,  when,  of  the  n  things,  there  are  rii  alike,  n2  others  alike,  na 
others  alike,  etc.,  then: 

P  =  

ni!?22!n3l  .  .  . 

7.  How  many  distinct  permutations  can  be  made  of  the  letters  of  the 
word  attention  taken  all  at  a  time? 

8.  How  many  distinct  permutations  of  the  letters  of  the  word  Mississippi 
can  be  formed  taking  the  letters  all  at  a  time? 

9.  How  many  ways  can  ten  balls  be  arranged  in  a  line  if  3  are  white,  5 
arp  ^ed,  and  2  are  blue? 

92.  COMBINATIONS 

A  group  of  things  or  elements  without  reference  to  the  order  of 
the  individuals  in  the  group  is  called  a  combination. 

Thus,  the  combinations  of  ab  c  d  taken  3  at  a  time  are  ab  c,  ab  d, 
ac  dfb  c  d.  From  each  combination  we  can  form  3 !  different  permu- 
tations, and  hence  from  the  4  combinations  we  can  form  (3 !)  •  4  =  24 
permutations  of  4  letters  3  at  a  time. 

A  combination  is  frequently  called  a  selection^  whereas  a  permuta- 
tion is  an  arrangement. 

The  number  of  combinations  of  n  things  taken  r  at  a  time  is  denoted 
by  nCr)  and  is  given  by  the  formula: 

nCr  =  (3) 

For  r!  permutations  can  be  formed  from  each  combination  of  r 
elements;  and  hence  the  total  number  of  permutations  must  be  r! 
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times  the  number  of  combinations,  nCr.  Tiiat  is  r !  •  nCr  =  nPr  from 
which  (3)  immediately  follows. 
By  applying  (1) : 

r  -  n(n  -  l)(n  ~  2)  »  >  »  (n  ~  r  +  1)  _  n! 

rl  r\{n  -  r)\ 

From  (4)  it  follows  immediately  that: 

nCf  —  ffin—r  (5) 

The  binomial  theorem,  which  is  usually  written  in  the  form 
may  be  conveniently  written 

(a+b)"  =  a'»+„Cia''-^b+„C2a"-"262+  hnCra"-''6^H  Vb^  (6) 

=  s\Cra"-'b'  (7) 

if  we  define  nCo  to  be  1. 

We  shall  now  illustrate  these  remarks  with  a  few  examples. 

Example  1.  In  how  many  ways  can  a  committee  of  9  be  selected  from 
12  people? 

This  is  evidently  a  problem  of  selection,  not  of  arrangement,  and  the 
result  is  evidently: 

12C9  =  12C3  =  — — - — - —  =  220 

Example  2.  From  6  men  and  5  women,  in  how  many  ways  can  we  select 
a  group  of  4  men  and  3  women? 

a.  We  can  select  the  4  men  from  6  men  in  ways. 

b.  We  can  select  the  3  women  from  5  women  in  bCs  ways. 

By  the  fundamental  principle  we  can  do  a.  and  b.  in  5^4  •  5C3  =  150 
ways. 

Example  3.  From  6  men  and  5  women,  how  many  committees  of  8  each 
can  be  formed  when  the  committee  contains  at  least  3  women? 

The  conditions  of  the  problem  are  satisfied  if  the  committee  contains: 

a.  5  men  and  3  women 

b.  4  men  and  4  women 
0.  3  men  and  6  women 
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Therefore  the  number  of  possible  committees  is 

It  frequently  happens  that  the  problem  involves  both  a  selection  and  an 
arrangement  with  a  limitation  upon  either.  In  such  problems  it  is  best  to 
consider  the  two  steps  separately.  A  safe  procedure  is  to  deal  first  with  the 
question  of  the  selections  (combinations)  and  then  with  the  arrangements 
(permutations) . 

Example  4.  How  many  line-ups  are  possible  in  choosing  a  baseball  nine 
of  5  seniors  and  4  juniors  from  a  squad  of  8  seniors  and  7  juniors,  if  any 
man  can  be  used  in  any  position? 

The  5  seniors  can  be  selected  in  sC^s  ways,  the  4  juniors  in  7C4  ways. 
Hence  the  set  of  players  can  be  selected  in  sCs  •  7C4  ways. 

Any  one  set  of  9  men  can  be  arranged  in  9 !  ways.  Hence  the  total  num- 
ber of  possible  line-ups  is  sCs  •  7C4  -9!. 

EXERCISES 

1.  Compute  10C2;  loCs;  100C98. 

2.  How  many  squads  of  6  men  each  can  be  selected  from  a  squad  of  60 
men? 

3.  In  how  many  ways  can  a  committee  of  3  teachers  and  2  students  be 
selected  from  8  teachers  and  15  students? 

4.  How  many  straight  lines  are  determined  from  10  points,  no  3  of  which 
are  in  the  same  straight  line? 

5.  How  many  different  sums  can  be  made  from  a  cent,  a  nickel,  a  dime, 
a  quarter,  a  half-dollar,  and  a  dollar? 

6.  From  10  books,  in  how  many  ways  can  a  selection  of  6  be  made: 
(a)  when  a  specified  book  is  always  included?  (b)  when  a  specified  book 
is  always  excluded? 

7.  Prove  that  nCr  +  nCr-l  =  n+lCr. 

8.  Out  of  6  different  consonants  and  4  different  vowels,  how  many  linear 
arrangements  of  letters,  each  containing  4  consonants  and  3  vowels,  can  be 
formed? 

9.  A  lodge  has  50  members  of  whom  6  are  physicians.  In  how  many 
ways  can  a  committee  of  10  be  chosen  so  as  to  contain  at  least  3  physicians? 

10.  In  equation  (6)  make  a  =  6  =  1,  and  show  that 

nCi  +  nC2  +  •   •   •   +  nCn  =  2^  ~  1 

11.  Solve  Exercise  5  above,  using  Exercise  10. 

12.  In  how  many  ways  can  7  men  stand  in  line  so  that  2  particular  men 
will  not  be  together? 

13.  A  committee  of  7  is  to  be  chosen  from  8  Englishmen  and  5  Americans. 
In  how  many  ways  can  a  committee  be  chosen  if  it  is  to  contain:  (a)  just  4 
Englishmen?  (b)  at  least  4  Englishmen? 
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14.  Prove:    n+2Cr+l    =  nCr-n  +  2  •  nCr  +  nCr-U 

15.  If  nPr  =110  and  „Cr  =  55,  find  n  and  r. 

16.  If  nC4  =  nC2,  find  n. 

17.  If  nCa  =  10/21(nC5),  findn. 

18.  If  2nCn-l  =  91/24(2n-2Cn),  find  U, 

19.  Prove:  nCi  +  2  •       +  3  •  nCa  +  •  •  •  +  n  •  nCn  =  n(2)^-\ 

93.  RELATIVE  FREQUENCY:  EMPIRICAL  PROBABILITY 

A  box  contains  2  white  and  3  black  balls  alike  except  in  color.  A 
ball  is  drawn  at  random,  the  color  of  it  is  noted,  and  then  it  is  re- 
placed in  the  box.  The  drawing  of  the  ball  and  replacing  it  is  called 
a  trial.  Suppose  we  make  100  such  drawings,  mixing  the  balls 
thoroughly  after  each  trial,  and  note  that  in  this  sample  of  100 
drawings  we  have  obtained  38  white  and  62  black  balls.  Then  we 
say  38/100  is  the  relative  frequency  of  white  balls  and  62/100  is  the 
relative  frequency  of  black  balls  in  this  set  of  trials.  Suppose  that  this 
experiment  is  repeated  and  that  in  the  next  100  trials  we  obtain 
42  white  balls  and  58  black  balls.  In  the  second  sample  of  100  trials 
the  relative  frequency  of  white  balls  is  42/100  and  that  of  the  black 
balls  is  58/100.  If  the  results  of  the  two  sample  sets  are  combined, 
we  will  then  have  obtained  in  the  200  drawings  80  white  balls  and 
120  black  balls,  and  the  resulting  relative  frequencies  of  white  balls 
and  black  balls  are  80/200  =  2/5  and  120/200  =  3/5  respectively. 

In  performing  experiments  of  the  type  described  in  the  preceding 
paragraph  the  happening  of  the  event  in  question  is  frequently  called 
a  success f  and  the  nonhappening  of  the  event  a  failure.  In  the  experi- 
ments described  the  drawing  of  a  white  ball  may  be  counted  a  success 
and  that  of  the  black  ball  a  failure.  It  may  be  noted  that  the  sum 
of  the  relative  frequencies  of  white  balls  and  black  balls  in  every 
sample  drawing  is  equal  to  unity.  In  general  if  wc  make  s  +  f  =  n 
trials  resulting  in  s  successes  and  /  failures  we  say  that: 

^  =  the  relative  frequency  of  the  successes 

and  f 

^  =  the  relative  frequency  of  the  failures 

The  sum  of  the  relative  frequencies  of  successes  and  of  failures  in 
any  set  of  trials  is  equal  to : 

i  +  /  =  ^  +/  =  ^  =  1 
n     n       n  n 
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The  fraction  s/n,  which  we  have  called  the  relative  frequency  of 
successes  in  n  trials,  may  be  considered  an  approximate  probabihty 
derived  from  observation.  If  n  is  large,  then,  until  further  knowledge 
is  obtained,  s/n  may  be  taken  as  a  good  estimate  of  the  probabihty 
of  success  in  a  given  trial.  Our  confidence  in  this  estimate  increases 
as  the  number,  n,  of  observed  cases  increases.  If,  as  n  increases 
indefinitely,  the  ratio  s/n  approaches  a  limiting  value,  p,  this  hmiting 
value  is  called  the  probability  of  a  sicccess  in  one  trial.  Hence: 


Thus,  if  we  continue  indefinitely  the  drawing  of  a  ball  from  a  box 
2/5  of  the  contents  of  which  are  white  balls,  we  may  assume  that 
the  relative  frequency  of  white  balls  would  approach  2/5,  and  we  say 
2/5  is  the  probabihty  of  obtaining  a  white  ball  in  a  single  trial. 

The  probability  that  we  have  thus  far  discussed  as  coming  from 
observation  is  frequently  called  empirical  probability. 

The  empirical  method  of  determining  probabihty  is  widely  used 
in  statistics,  pension  systems,  life  insurance,  fire  insurance,  etc.  In 
using  the  experimental  method  we  shall  simply  idealize  actual  ex- 
perience and  assume  that  the  Umit  of  s/n  exists,  and  that,  if  n  is 
large,  s/n  is  a  good  estimate  of  the  limit. 


1.  In  a  certain  experiment  of  coin-tossing  heads  appeared  2,048  times  in 
4,040  throws.   What  is  the  relative  frequency  of  heads?  of  tails? 

2.  In  an  experiment  in  coin  tossing,  7  dimes  were  thrown  128  times  with 
the  following  results : 


lim  s 


p  = 


n 


n 
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Number  of 
Heads 
X 


0 
1 
2 
3 
4 
5 
6 
7 


2 
8 
16 
38 
43 
16 
2 
3 


Total 


128 
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Find  the  relative  frequency  of  0  heads;  1  head;  etc. 
Compute  M  and  a  for  this  distribution. 

3.  Among  10,000  people  aged  30,  85  deaths  occurred  in  a  year.  What 
was  the  relative  frequency  of  deaths? 

4.  Out  of  1,000  children  born  in  a  city  in  a  given  year,  514  were  boys 
and  486  were  girls.  What  is  the  relative  frequency  of  boys  among  the  chil- 
dren that  year? 

5.  As  a  cooperative  exercise  for  the  class,  make  1,000  tosses  of  a  coin 
and  keep  a  record  of  the  number  of  heads  in  (a)  10  trials,  (b)  100  trials, 
(c)  250  trials,  (d)  1,000  trials.  In  each  case  compare  the  observed  relative 
frequency  with  the  expected  relative  frequency,  1/2. 


94.  THEORETICAL  RELATIVE  FREQUENCY: 
A  PRIORI  PROBABILITY 

In  certain  cases,  such  as  games  of  chance  or  drawing  balls  from  a 
bag,  the  probability  may  be  obtained  without  collecting  statistical 
data  on  frequencies.  In  these  cases  we  make  use  of  certain  assump- 
tions that  will  give  us  the  prot>ability  without  actually  making  the 
trials.  For  example,  if  a  coin  is  tossed  we  assume  that  it  is  so  con- 
structed and  tossed  that  tails  are  just  as  likely  to  come  up  as  heads, 
and  hence: 

the  probability  of  heads  =  the  probability  of  tails  =  | 

Similarly,  if  a  bag  contains  4  white  balls  and  6  black  balls  alike 
except  as  to  color,  and  thoroughly  mixed,  and  a  ball  is  drawn  at 
random,  the  probability  of  drawing  a  white  ball  is  4/10  and  the  prob- 
ability of  drawing  a'black  ball  is  6/10.  These  illustrations  are  simple 
applications  of  the  following: 

Definition.  //  all  the  successes  and  failures  can  be  analyzed  into 
s  +  f  possible  ways,  each  of  which  is  equally  likely,  and  if  s  of  these 
ways  give  successes  and  f  of  them  failures,  the  probability  of  success  in  a 
single  trial  is  defined  as  p  =  s/{s  +  f)  and  the  probability  of  failure  is 
defined  as  q  =  f/(s  +  /). 

Example  1.  A  bag  contains  8  black  balls  and  3  white  balls,  and  a  ball  is 
drawn  at  random.  What  is  the  probability  of  drawing  a  white  ball?  a 
black  ball? 

If  the  probability  of  drawing  a  white  ball  is  counted  a  success,  we  have 
s  =  3,  /  =  8,  s  +  /  =  11,  and  hence  p  =  A  and    =  A- 
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Returning  to  the  foregoing  definition,  we  may  note  that  if  p  ==  0, 
the  event  in  question  cannot  happen  or  is  impossible.  If  p  ==  1,  the 
event  is  certain  to  happen. 

Example  2.  From  a  bag  containing  8  white  balls  and  3  black  balls,  5 
balls  are  drawn  at  random.  What  is  the  probability  that  3  are  white  and 
2  are  black? 

The  total  number  of  balls  in  the  bag  is  IL  Hence  the  number  of  ways  of 
selecting  5  balls  from  11  balls  is  nCs.  The  3  white  balls  can  be  selected  from 
8  white  balls  in  sCs  ways,  and  the  2  black  balls  can  be  selected  from  3  black 
balls  in  3C2  ways.  Hence  s  —  sCs  •  3C2,  and  the  probability  of  drawing  3 
white  and  2  black  balls  is: 


Example  3.  If  5  coins  are  tossed,  what  is  the  probability  of  obtaining 
2  heads  and  3  tails? 

Five  coins  may  fall  in  2^  =  32  ways.  Two  heads  may  be  selected  from 
the  5  in  6C2  =  10  ways.   Hence  the  probability  is  ii. 


1.  If  a  die  is  thrown,  what  is  the  probability  that  a  six  will  appear? 
that  either  a  five  or  a  six  will  appear?  that  a  four,  five,  or  six  will  appear? 

2.  If  2  dice  are  thrown,  what  is  the  probability  of  obtaining  a  double 
six? 

3.  If  2  dice  are  thrown,  what  is  the  probability  of  obtaining  a  sum  of 
11?  a  sum  of  7?   What  is  the  most  probable  sum  in  a  throw  of  2  dice? 

4.  A  deck  of  52  cards  is  well  shuffled  and  a  card  is  drawn.  What  is  the 
probability  that  it  is  a  queen?  an  ace  or  a  queen?  a  heart?  a  red  card? 

6.  What  is  the  chance  of  throwing  one  and  only  one  five  in  one  throw 
with  2  dice? 

6.  If  2  dice  are  thrown,  what  is  the  chance  of  throwing  at  least  one  five? 

7.  If  2  coins  are  tossed,  what  is  the  probability  of  obtaining  2  heads? 

2  tails?  1  head  and  1  tail? 

8.  If  3  coins  are  tossed,  what  is  the  probability  of  getting  3  heads? 

3  tails?  2  heads  and  1  tail? 

9.  What  is  more  likely  to  happen,  a  throw  of  four  with  1  die  or  a  throw 
of  six  with  2  dice? 

10.  What  is  the  probability  of  throwing  2  sixes  and  1  five  in  a  single 
throw  with  3  dice? 

11.  If  12  men  stand  in  line,  what  is  the  chance  that  A  and  B  are  next  to 
each  other? 


P 


11 
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12.  From  a  pack  of  52  cards,  3  cards  are  drawn  at  random.  What  is 
the  chance  that  they  are  all  clubs? 

13.  Prove: 


The  expected  number  of  occurrences  of  an  event  in  n  trials  is  defined 
as  np  where  p  is  the  probability  of  occurrence  of  the  event  in  a  single 
trial. 

Thus,  if  100  coins  are  thrown  or  if  1  coin  is  thrown  100  times, 
theoretically  we  expect  50  heads  and  50  tails,  for  n  =  100  and 

P  =  g  ==  i 

If  a  die  is  rolled  36  times,  theoretically  we  expect  an  ace  to  turn 
up  6  times,  for  n  =  36  and  p  -  \. 

If  .008  is  the  probability  of  death  within  a  year  of  a  man  aged  30, 
the  expected  number  of  deaths  within  a  year  among  10,000  men  of 
this  age  would  be  80,  for  n  =  10,000  and  p  =  .008. 

Question :  Two  coins  are  thrown  100  times.  What  is  the  expected  num- 
ber of  2  heads?  2  tails?  1  head  and  1  tail? 

If  p  is  the  probability  that  a  person  will  win  a  sum  of  money  m, 
we  define  his  expectation  as  pm. 

Thus,  if  a-  person  is  to  receive  $32  in  case  he  tosses  4  coins  and  they  all 
fall  heads,  the  value  of  his  expectation  is  $2,  for  m  =  $32  and  = 

Question :  A  stake  of  $24  is  made  contingent  upon  getting  a  sum  greater 
than  10  in  a  single  throw  with  2  dice.  What  is  the  value  of  the  expectation? 


A.  Mutually  Exclusive  Events.  Two  or  more  events  are  said  to  be 
mutually  exclusive  when  the  occurrence  of  any  one  of  them  excludes 
the  occurrence  of  any  other.  Thus,  in  the  toss  of  a  coin  the  appearance 
of  heads  and  the  appearance  of  tails  are  mutually  exclusive.  Also, 
if  a  bag  contains  white  and  black  balls  and  a  ball  is  drawn,  the 
drawing  of  a  white  ball  and  the  drawing  of  a  black  ball  are  mutually 
exclusive  events. 


95.  EXPECTATION 
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Theorem,  If  Pu  p2j  .  .  pr  are  the  separate  probabilities  of  r 
mutually  exclusive  events^  the  probability  P,  that  one  of  these  events  will 
happen  in  a  single  trial  is  the  sum  of  the  probabilities  of  the  separate 
events.   That  is: 

P  =  Pi+p2  +  '  —  +  pT  (8) 

By  the  definition  in  the  preceding  section,  out  of  n  trials  in  which 
all  of  the  events  are  in  question,  the  r  events  are  expected  to  occur 
pirij  p2n,  .  .  .,  Prn  times  respectively.  Since  only  one  of  these 
events  can  occur  on  a  given  trial,  it  follows  that  out  of  n  trials  one 
or  another  of  the  r  events  will  occur  (pin  +  p2n  +...-{-  p^n)  or 
(pi  +  P2  +  '  '  •  +  Pr)n  times.  That  is,  the  total  probability  that 
one  of  the  events  will  occur  on  a  given  trial  is: 

D        (Pl  +  P2  +   •    •    •   +  Pr)n 

P  =  ^   =  Pi  +  P2  +  '   '   '   +  Pr 

n 

When  two  mutually  exclusive  events  are  in  question,  the  proba- 
bilities are  frequently  called  either  or  probabilities.  Thus,  if  a  die  is 
thrown,  the  probability  of  either  an  ace  or  a  deuce  is  |  +  ^  or  ^. 

B.  Independent  Events.  Two  or  more  events  are  dependent  or 
independent  according  as  the  occurrence  of  any  one  of  them  does  or 
does  not  affect  the  occurrence  of  the  others.  Thus,  if  A  tosses  a 
coin  and  B  throws  a  die,  the  tossing  of  heads  by  A  and  the  throwing 
of  a  deuce  by  B  are  independent  events.  However,  if  a  bag  contains 
a  mixture  of  white  and  black  balls  and  a  ball  is  drawn  and  not  re- 
turned to  the  bag,  the  probabilities  in  a  second  drawing  will  be 
dependent  upon  the  first  event. 

Theorem.  If  pu  p2i  -  .  -  ,  Pr  d'^e  the  separate  probabilities  of  r 
independent  events,  the  probability  P,  that  they  all  occur  on  a  given 
trial  when  all  of  them  are  in  question,  is  the  product  of  their  separate 
probabilities.   That  is: 

P  =  plp2pZ  •   >  -  Pr  (9) 

By  the  definition  of  the  preceding  section,  out  of  n  trials  in  which 
all  of  the  events  are  in  question,  the  first  event  is  expected  to  occui 
Pin  times.  Out  of  this  number,  pin,  the  second  event  is  expected  to 
occur  Piipin)  =  npip2  times.  That  is,  both  are  expected  to  occur 
npip2  times  in  n  trials.    Continuing  this  process,  it  is  seen  that  out 
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of  n  trials  all  of  the  r  events  are  expected  to  occur  npip2Pz  -  -  -  Pr 
times.  Hence: 

r>     npipipz  .  .  .  Pr 
n 

Example  1.  If  A  tosses  a  coin  and  B  throws  a  die,  what  is  the  probability 
that  A  will  toss  heads  and  B  will  throw  a  deuce? 

The  probability  that  A  will  toss  heads  is  ^  and  the  probability  that  B 
will  throw  a  deuce  is  J.  Since  the  two  events  are  independent,  the  proba- 
bility that  both  events  will  occur  is  i  •  i  =  A- 

Example  2.  If  a  coin  is  tossed  3  times,  what  is  the  probability  of  heads 
every  time? 

The  probability  of  heads  on  any  throw  is  |.  Hence  for  the  3  throws, 
since  they  are  independent,  P  —  i  -  i  -  i  =  i. 

When  two  independent  events  are  in  question,  the  probabilities  are  fre- 
quently called  both  and  probabilities.  Thus  in  Example  1  if  the  tossing  of 
heads  by  A  is  event  Ei  and  the  throwing  a  deuce  by  B  is  event  E2,  then  the 
probabiUty  that  both  Ei  and  E2  occur  is  yV- 

In  Example  1,  what  is  the  probability  that  either  A  will  toss  heads  or  B 
will  throw  a  deuce? 

Corollary.  If  pi,  p2j  .  .  pr  are  the  separate  prohahilities  of  r 
independent  events,  the  probability  that  they  will  all  fail  on  a  given 
occasion  is 

(1  -  Pi){l  -  ^j)  ...  (1  -  p,)  (10) 
and  the  probability  that  the  first  k  events  will  occur  and  the  remainder 

^""^  Pl'P2.  .  .  Pkd  -  Pk+l)  ...  (1  -  A)  (11) 

C.  Dependent  Events.  The  following  theorem  for  dependent 
events  may  be  proved  by  an  analogous  method  to  that  used  for 
independent  events. 

Theorem.  //  the  probability  of  a  first  of  r  events  is  pi,  and  if,  after 
this  has  occurred,  the  probability  of  a  second  event  is  p2,  and  if,  after 
the  first  and  second  events  have  occurred,  the  probability  of  a  third  event 
is  pzy  and  so  on,  then  the  probability  P,  that  the  events  vrill  occur  in  the 
order  specified  is:  »  ^  /io\ 

F  =  pip2pZ  '      '  pr 
EXERCISES 

1.  If  5  balls  are  drawn  from  a  bag  containing  6  red  and  9  white  balls, 
what  is  the  probability:  (a)  that  all  will  be  red?  (b)  that  3  will  be  red  and 
2  white? 
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2.  A  draws  3  cards  from  a  well-shuflBied  pack  and  simultaneously  B 
tosses  a  coin.  What  is  the  probability  of  3  aces  and  1  head? 

3.  If  4  coins  are  tossed,  what  is  the  probability  that  all  will  fall  tails? 

4.  A,  B,  and  C  go  bird-hunting.  A  has  a  record  of  1  bird  out  of  2,  B  gets 
2  out  of  3,  and  C  gets  3  out  of  4.  What  is  the  probability  that  they  will 
kill  a  bird  at  which  all  shoot  simultaneously?  Hint:  What  is  the  proba- 
bility that  all  3  miss? 

5.  If  the  probability  that  A  will  die  within  a  year  is  i%  and  the  proba- 
bility that  B  will  die  within  a  year  is  -j^^,  what  is  the  probability  that: 
(a)  both  A  and  B  will  die  within  a  year?  (b)  both  A  and  B  will  live  a  year? 
(c)  one  life  will  fail  within  a  year? 

6.  The  probability  that  A  will  solve  a  problem  is  J  and  that  B  will  solve 
it  is  f .  What  is  the  probability  that  if  A  and  B  try  the  problem  it  will  be 
solved? 

7.  In  a  single  throw  of  2  dice  what  is  the  chance  that  neither  doublets 
nor  seven  will  appear? 

97.  REPEATED  TRIALS 

As  we  proceed  into  the  text  the  observing  student  will  be  amazed 
at  the  importance  of  the  theory  of  the  probability  of  repeated  trials 

in  the  theory  of  statistics.  This  is,  of  course,  due  primarily  to  the 
fact  that  much  of  statistical  data  is  a  kind  of  repeated  measurement. 

In  order  to  familiarize  ourselves  with  the  method  of  proof  of  the 
general  theorem  of  this  section,  let  us  consider  something  simple. 

Example.  What  is  the  probability  of  throwing  2  aces  in  4  throws  of  a  die? 

The  conditions  of  the  problem  are  met  if  in  the  first  2  throws  we  obtain 
aces  and  in  the  next  2  throws  not-aces;  or  if  in  the  first  throw  we  get  ace, 
the  second  throw  not-ace,  the  third  throw  ace,  and  the  fourth  throw  not- 
ace;  and  so  on.  We  shall  illustrate  the  possibilities  symbolically  as  follows: 

A1A2  ,  ^1  —  ^3  —jAi  A4,  —  ^2^43  ~  ,  —  —  A4,  ^3^44 

Considering  the  first  case,  the  probability  of  throwing  an  ace  on  any 
throw  is  ^.  The  probability  of  not  throwing  an  ace  on  any  throw  is  |. 
Hence  the  probability  of  throwing  an  ace  on  the  first  and  second  throws 
and  not  throwing  an  ace  on  the  two  remaining  throws  is  (DHf)'^- 

In  the  second  case,  the  probability  of  events  occurring  as  the  symbol 
above  indicates  is  = 

The  remaining  cases  may  be  treated  in  a  similar  manner,  and  in  each 
instance  the  result  for  any  specified  set  is  (i)'^(f)^  Now  it  is  evident  that 
the  2  aces  can  be  selected  from  the  4  possible  aces  in  4C2  =  6  ways.  Since 
the  6  cases  are  mutually  exclusive,  the  chance  that  one  or  the  other  of  the 
specified  cases  occurs  is  6(i)^(t)^  =  tWV* 
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Let  us  now  consider  an  important  theorem. 

Theorem  of  Repeated  Trials.  If  p  is  the  probability  of  the  success 
of  an  event  in  a  single  trial  and  q  is  the  probability  of  its  failure, 
{p  +  q  =  1),  then  the  probability  Pj  that  the  event  will  succeed  exactly 
r  times  in  n  trials  is:  ^ 

Pr  =  nCrpr  (13) 

For  the  probabiUty  that  the  event  will  succeed  in  each  of  r  specified 
trials  and  will  fail  in  the  remaining  (n  —  r)  trials  is,  by  (11),  p''q''~^. 
Further,  it  is  possible  for  the  r  successes  to  occur  out  of  n  trials  in 
nCr  different  ways.  These  ways  being  mutually  exclusive,  by  (8)  the 
probability  in  question  is  Pr  =  nCrP'^q^'^. 

The  various  probabilities  are  indicated  in  the  following  table : 


Table  87.   Values  of  Pr  for  Various  Values  of  r 


r 

Pr 

The  Probability  That  in  n  Trials 
There  Will  Be 

n 

n  -  1 
n  -  2 

n       successes,  0  failures 

n  -  1      "  ,1 

n -  2      "      ,2  " 

n  —  r 

n  —  r  successes,  r  failures 

r 

nCrVq""-' 

r            "     y  n  —  r  " 

2 
1 
0 

2            "     ,  n  -  2  " 
1            "      ,  n  -  1  " 
0            "  ,n 

Total 

(V  +  g)'^  =  1 

From  Table  87  we  have  at  once  the  following: 

Corollary.  The  probability  that  an  event  will  succeed  at  least  r  times 
in  n  trials  is  Pr  +  Pr+i  +  •  •  •  +  P„,  that  is: 

%Pr  =       +  nCip^-'^q  +  nCip^'-^q^  +  •  •  «  +  nCrp'q''-''  (14) 
r 

It  will  be  noted  that  (14)  consists  of  the  first  (n  —  r  +  1)  terms  of 
the  expansion  {p  +  q^, 

^  It  will  be  noted  that  (13)  is  the  (n  —  r  -f  l)th  term  of  the  expansion  (p  +  g)** 
and  the  (r  +  l)th  term  .of  the  expansion  (g  +  p)'». 
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Example.  An  urn  contains  12  white  and  24  black  balls.  What  is  the 
probability  that,  in  10  drawings  with  replacements,  exactly  6  white  balls 
are  drawn? 

We  have:  =  1 

^  "  36  ~  3  ^  "  36  -  3* 

n  =  10,  r  =  6  n  —  r  =  4, 

hence:  ^        ^  flY /2Y  3360 

-       W  W  ^  "3^ 

Since  the  computation  of  Pr  in  (13)  involves  the  computation  of 
n! 

nCr  =  ^1^^       I J  we  may  naturally  wonder  what  can  be  done  when 

n  and  r  are  so  large  that  the  labor  of  evaluating  n !,  r !,  and  (n  —  r) ! 
becomes  tedious  if  not  prohibitive.  At  present  we  can  recommend 
two  alternatives.  If  tables  of  the  logarithms  of  factorial  n  are  at 
hand,^  then  Pr  can  be  conveniently  computed  by  logarithms.  If 
such  tables  are  not  at  hand,  approximate  results  can  be  found  by 
applying  Stirling's  formula,  namely: 

nl  =  e-WA/27rn  (15) 

The  derivation  of  this  formula  depends  upon  the  calculus  and  is 
therefore  beyond  the  scope  of  this  text.^   For  large  values  of  n,  it 
gives  satisfactory  results. 
Consider  the  following: 

Example.  An  urn  contains  2  white  and  3  black  balls.  What  is  the 
probability  that,  in  500  drawings  with  replacements,  exactly  200  white 
balls  will  be  drawn? 


Solution: 


2  3 
n  ==  500,  P  =       ^  i 

X  200  —  600<^200l  ^  )  l^) 


500 !    /2y^  /3' 


300 


200 !  300 !  V5/  V5. 


^  An  excellent  set  of  tables  is  J.  W.  Glover,  Tables  of  Applied  Mathematics^ 
George  Wahr,  Ann  Arbor,  Michigan,  1923. 

*  See  J.  L.  Coolidge,  An  Introduction  to  Mathematical  Probability^  p.  38,  for  a 
derivation. 
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Applying  Stirling's  formula: 


 500sooe-«)0\/27r  -  500  /'^^^  /3\*» 

200200       V2w  •  200  SOO^^o  e'^  V2w  •  300  W  W 


200200  300300 10  Vl2x  W  \5/ 
5500  100500  V5  2200  3300 


2200 100200  3300 100300 10  VT2t  5^ 
=  .036. 


10Vl27r 

If  Glover's  Tables  are  used  with  logarithms  the  result  is: 

P200  =  .041 

EXERCISES 

1.  A  coin  is  tossed  7  times,  or  7  coins  are  tossed  one  time.  Find  the 
probability  of  exactly:  (a)  no  heads,  (b)  1  head,  (c)  2  heads,  etc.  to  7  heads. 

2.  Seven  coins  are  tossed  128  times.  Using  the  Definition  in  Section  95 
(p.  374),  and  the  probabihties  of  the  last  exercise  (1),  find  the  expected 
number  of  times  of  0  heads,  1  head,  2  heads,  etc.  to  7  heads.  Compare  the 
results  with  those  of  Exercise  2  (p.  371). 

3.  If  a  die  is  thrown  6  times  or  if  6  dice  are  thrown  1  time,  what  is  the 
probability  of  obtaining:  (a)  exactly  2  aces?  (b)  at  least  3  aces? 

4.  Find  the  probability  of  throwing  with  a  single  die  a  deuce  at  least 
once  in  5  trials. 

6.  Prove  that  the  probability  that  an  event  will  succeed  at  least  once 
in  n  trials  is  (1  ~  g**). 

6.  In  tossing  10  coins,  what  is  the  probabiHty  of  obtaining  at  least 
8  heads? 

7.  A  man  whose  batting  average  is  will  bat  4  times  in  a  game.  What 
is  the  probability  that  he  will  get  (a)  exactly  2  hits?  (b)  at  least  2  hits? 

8.  According  to  the  American  Experience  Table  of  Mortality,  out  of 
100,000  persons  living  at  the  age  of  10  years,  91,914  are  living  at  the  age 
of  21  years.  Each  of  7  boys  is  now  10  years  old.  What  is  the  probability 
that  exactly  5  of  them  will  live  to  be  21? 

9.  A  bag  contains  4  white  and  2  black  balls.  Five  balls  are  drawn  with 
replacements.  What  is  the  probability:  (a)  that  exactly  3  are  white? 
(b)  that  at  least  3  are  black? 

10.  What  is  the  probability  of  throwing  at  least  3  sevens  in  5  throws 
with  a  pair  of  dice? 

11.  How  many  throws  with  2  dice  will  be  required  in  order  that  the 
probability  of  obtaining  a  double  six  at  least  once  will  have  the  value  §? 

Hint:  If  J  =  1  ~  (J^)n,  find  n. 
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12.  At  an  old  men's  home  are  5  seventy-year  old  men.  Find  the  proba- 
bility that  (a)  exactly  2  of  them  will  die  within  a  year,  (b)  that  a  specified 

2  of  them  will  die  within  a  year,  (c)  that  at  least  2  of  them  will  die  within 
a  year.  The  probability  that  a  man  aged  70  lives  a  year  is  pio  =  0.94. 
Hence  qjo  ~  0.06. 

13.  Hospital  records  show  that  5  per  cent  of  cases  of  a  certain  disease 
are  fatal.  Five  patients  are  admitted  with  this  disease.  Find  the  proba- 
bility (a)  that  all  will  recover,  (b)  that  exactly  3  will  die,  (c)  that  at  least 

3  will  die. 

14.  A  marksman  is  able,  on  the  average,  to  hit  a  target  950  times  out 
of  1,000.  Find  the  probability  that  he  will  obtain  (a)  exactly  9  hits  in 
10  shots,  (b)  exactly  10  hits  in  10  shots,  (c)  either  9  or  10  hits  in  10  shots, 
(d)  at  least  5  hits  in  10  shots.   Express  symbolically. 

15.  The  registrar's  records  show  that  10  per  cent  of  the  students  fail 
a  certain  course.  The  present  enrolment  in  the  course  is  25.  What  is  the 
probability  that  5  will  fail? 

16.  In  the  long  run  3  vessels  out  of  every  100  are  sunk.  If  10  vessels 
are  out,  what  is  the  probability  (a)  that  exactly  6  will  arrive  safely? 
(b)  that  at  least  6  will  arrive  safely?   Express  symbolically. 

17.  A  batch  of  1,000  electric  bulbs  was  tested  and  found  to  be  5  per 
cent  bad.  If  another  batch  of  100  lamps  is  manufactured  under  similar 
conditions,  what  is  the  probability  that  not  more  than  10  per  cent  will  be 
defective?   Give  the  result  symbolically. 

18.  The  American  Experience  Mortality  Table  states  that  for  an  indi- 
vidual aged  25  the  probability  of  survival  another  year  is  p  =  0.992. 
What  probabilities  are  expressed  by  the  following: 

900 

<a)  ioooC2oo(.992)8oo(.008)2oo?  (b)  2ioooa(.992)i«w-^(.008)'-? 

r  =  700 

19.  Ay  B,  and  C  are  three  marksmen.  A's  record  is  4  hits  in  5  shots, 
jB's  record  is  3  hits  in  4  shots,  and  O's  record  is  2  hits  in  3  shots.  They  fire 
simultaneously.   What  is  the  probability  that  at  least  2  shots  hit? 

20.  Of  7  dates  picked  at  random,  what  is  the  probability  that  (a)  exactly 
5  are  Sundays,  (b)  at  least  5  are  Sundays,  (c)  the  first  5  but  no  others  are 
Sundays? 

21.  A  can  hit  a  target  4  times  in  5  shots;  By  three  times  in  four  shots. 
They  fire  a  volley.  What  is  the  probability  (a)  that  at  least  two  shots  hit? 

(b)  that  at  least  one  shot  hits? 

22.  A  student  takes  a  true-false  test  consisting  of  10  questions  and 
guesses  at  the  answers.  Assuming  he  is  equally  likely  to  answer  incorrectly 
as  correctly  on  each  question,  find  the  probability  (a)  that  he  will  answer 
all  the  questions  correctly,  (b)  that  he  will  answer  half  of  them  correctly, 

(c)  that  he  will  answer  80  per  cent  or  more  of  them  correctly. 

23.  In  the  long  run  a  child  under  one  year  of  age  who  is  attacked  by 
whooping  cough  has  about  a  fifty-fifty  chance  of  recovery.  If  10  children 
under  one  year  of  age  are  attacked  by  this  disease, 
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(a)  what  is  the  expected  number  of  deaths? 

(b)  what  is  the  probability  that  the  expected  number  will  die? 

(c)  what  is  the  probability  that  8  or  more  recover? 

24.  In  how  many  throws  with  a  single  die  will  it  be  an  even  chance  that 
'^1"  turns  up  at  least  once? 

25.  If  2  dice  are  thrown,  what  is  the  probabiUty  of  obtaining  a  total 
of  7? 

Hint:  The  number  of  ways  of  obtaining  a  total  of  7  is  the  coefficient  of 
x'  in  {x  +     +     +     +     -\-  x^)\  or  of  a;^  in  {1     x  +  x^  +  x^  +  oi:^ 

+  x'Y,  or  of  x'  in  (^J—J  =  (1  -  - 

26.  If  three  dice  are  thrown  what  is  the  probability  of  obtaining  a 
total  of  10? 

Hint:  The  number  of  ways  of  obtaining  a  total  of  10  is  the  coefficient 
of  x^^  in  {x  +  x'^  +  x^  +  x^  +  x^  +  x^y, 

27.  If  three  dice  are  thrown  what  is  the  probability  of  obtaining  a 
total  of  8? 

28.  Three  dice  are  thrown.  Show  that  the  probability  of  obtaining  a 
total  of  4  is  equal  to  the  probability  of  obtaining  a  total  of  17. 


Chapter  12 

THE  POINT  BINOMIAL  AND  THE  NORMAL  CURVE 


98.  INTRODUCTION 

In  the  preceding  chapter  considerable  emphasis  was  placed  upon 
what  is  essentially  the 

Theorem.  //  p  is  the  probability  of  the  success  of  an  event  in  a 
single  trial  and  q  is  the  probability  of  its  failure  +  (i'  =  1),  then  the 
successive  terms  of  the  binomial  expansion 

{q+pY  =  g''+nCig"-ij>+„C2g'»-2/>2+  •  •  •  '\'nCxq''-^P^+  (1) 

give  the  respective  probabilities  that  in  n  trials  the  event  will  succeed  in 
0,  1,  2,  .  .  .fXj.  .     n  times. 

It  should  be  especially  noted  that  the  general  term 

Px  =  nCxq^^'^p^ 


gives  the  probability  that  the  event  will  succeed  exactly  X  times  in 
n  trials. 

Figure  61 
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Example  1.  If  a  coin  is  tossed  10  times  (or  if  10  coins  are  tossed  1  time), 
the  successive  terms  of  the  expansion 

(i+i)'' =tAtC1  +  10  +  45  +  120  +  210  +  252  +  210  +  120  +  45+10  +  1] 

give  the  probabilities  of  0  heads;  1  head,  9  tails;  2  heads,  8  tails;  etc. 

If  the  terms  of  this  expansion  (|  +  be  plotted  as  ordinates  at  unit 
distances  along  the  horizontal  axis,  it  will  be  noted  that  the  points  are 
symmetrically  distributed  about  the  vertical  through  X  =  5  (Fig.  51). 
It  will  be  shown  later  that  the  symmetry  is  due  to  the  fact  that  p  =  q  = 

Example  2.  Nine  balls  are  drawn  singly,  with  replacements,  from  a  bag 
containing  white  and  black  balls  in  the  ratio  of  2  to  1.  If  the  probability  of 
drawing  a  white  ball  is  counted  a  success,  p  =  %,  q  =  i,  and  the  successive 
terms  of  the  expansion 

+  !)'  =  nfh^  [1  +  18  +  144  +  672  +  2016  +  4032  +  5376  +  4608 

+  2304  +  512] 

give  the  probabilities  of  drawing  0  white  balls;  1  white,  8  black  balls; 
2  white,  7  black  balls,  etc. 

If  these  probabilities  be  plotted  as  ordinates,  as  the  figure  below  indicates, 
it  is  noted  that  the  points  are  not  symmetrically  distributed.  That  the 
skewness  here  is  due  to  the  inequality  of  p  and  q  will  be  shown  in  the 
succeeding  section. 


Figure  52 
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99.  CHARACTERISTICS  OF  THE  POINT  BINOMIAL 

It  has  been  observed  in  the  preceding  section  that  the  binomial 
distributions  possess  certain  geometrical  similarities  to  the  observed 
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distributions  studied  in  Chapters  3,  4,  and  5;  namely,  they  are 
relatively  low  at  the  extremes  and  rise  to  a  single  mode  near  the 
center.  These  similarities  are  so  striking  that  we  shall  use  the 
binomial  distribution  as  a  theoretical  or  approximate  distribution 
with  which  to  compare  distributions  of  observed  data.  That  is, 
we  shall  use  the  point  binomial  as  the  first  approximation  to  dis- 
tributions of  observed  data. 

We  shall  need  to  characterize  the  binomial  distribution,  as  we  have 
other  distributions,  by  computing  measures  of  central  tendency, 
dispersion,  skewness,  etcetera.  Having  computed  these  constants 
for  the  theoretical  distribution,  we  shall  apply  the  results  to  dis- 
tributions of  observed  data  for  purposes  of  comparison  and  general- 
ization. 

A.  The  Mode.  Since  the  sum  of  the  terms  of  (g  +  p)**  equals 
unity  and  the  extreme  terms  are  usually  smaller  than  those  near  the 
center,  it  would  seem  that  for  a  determinate  value  of  X,  say  X  =  ojj 

will  have  a  maximum. 
In  order  for  Px  to  be  a  maximum  for  Z  =  a,  we  must  have 

a.  Pa-l^Pa 

b.  Pa  ^  Pa-\-l 

that  is,  we  must  have 

a.  nCa_i5--«+ij)«-i  ^  nCaq^'-'^V'^ 

Figure  53 
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Using  the  relation 

a!(n-a)! 

the  first  inequality  reduces  to 

a  ^      +  p  =  (n  +  l)p 
and  the  second  reduces  to : 

That  is,  a  satisfies  the  double  inequality: 

np-q^a^np+p  (2) 


Diagram  13 


 > 

<  

<  

P 

 ^ 

np  —  q  np  np  +  p 


If  np  +  p  is  an  integer,  so  is  np  —  g  the  next  lower  integer.  In 
this  case  two  values  of  a  satisfy  (2)  since  a  is  necessarily  integral. 
They  are  a  =  np  +  p  and  a  =  np  —  q.  [See  Exercise  6,  p.  390.]  Thus: 

has  two  equal  terms  which  are  larger  than  any  other  terms,  one  at 
a  =  f  +  I  =  2,  and  the  other  atQ:==f  —  |  =  1.  Recalling  that 
Pa  is  the  {a  +  l)th  term  of  (1),  the  second  and  the  third  terms  are 
two  equal  terms  which  are  larger  than  any  other  terms.  Similarly, 

(i  +  !)'  =  irbCl  +  10  +  40  +  80  +  80  +  32] 

has  two  equal  terms  which  are  larger  than  any  other  terms,  since 
np  +  p  =  5(f)  +  I  is  an  integer.  They  are  the  fourth  and  the  fifth 
terms. 

If  np  +  p  or  (n  +  l)p  is  fractional,  so  is  np  —  g  since  np  —  q 
=  np  —  (1  —  p)  =  (n  +  l)p  —  1.  By  (2),  a  must  be  the  integer  lying 
between  them.  Since  there  is  only  one  such  integer,  it  must  be  a. 
Thus  (f  +  f  )^  has  only  one  maximum  term.  For  in  this  case  np  +  p 
=  6(1)  +  f  ==  4f ,  and  np  —  q  =  3f .  Hence  a  =  4,  and  the  fifth  term 

is  the  maximum  term.  The  entire  expansion  is: 

(i  +      =  Ti^El  +  12  +  60  +  160  +  240  +  192  +  64] 
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We  may  summarize  these  results  into  the  following: 

Theorem.  If  np  +  p  ornp  -  q  is  fractional,  Px  has  one  maximum 
term  for  X  equal  to  the  greatest  integer  in  np  +  p.  If  np  +  p  ornp  —  q 
is  integral,  Px  has  two  equal  terms  which  are  larger  than  any  other  terms. 
They  occur  when  X  equals  np  +  p  and  np  —  q. 

If  n  is  large  and  np  relatively  large  when  compared  with  p  and  q, 
np  closely  approximates  np  +  p  and  np  -  q.  In  this  case  we  call 
np  the  expected  number  of  successes  and  n  —  np  =  n(l  —  p)  =  nq 
the  expected  number  of  failures  of  the  event  in  n  trials. 

The  probability  of  np  successes  is  given  by  Pnp  =  nCupP^'^q'^ 
which,  upon  applying  Stirling's  formula  (p.  379),  reduces  to 

1 

V  2Tnpq ' 

a  very  small  number.  That  is,  the  probability  of  obtaining  the  ex- 
pected number  of  successes  (or  failures)  is  a  very  improbable  event. 

B.  The  Mean,  the  Dispersion,  the  Skewness.  The  computation 
of  M,  cr,  0:3,  and  is  greatly  facilitated  by  the  preparation  of  Table  88 
in  which  /(X)  =  nCxq''~^p^  indicates  the  ordinate  corresponding  to 
the  given  abscissa,  X. 


Table  88 


X 

(1) 

f(X)  =nCxQ^- 

(2) 

■XpX 

Xf(X) 
(3) 

(4) 

X(X~l)(X-2)/(X) 
(5) 

0 

1 

gn 

0 

nq^'^p  • 

0 
0 

0 
0 

2 
3 

n(n~l) 
1-2  ^ 
n(n~l)(r?  -2) 

qn-3p3 

n(n  -l)g'»~2p2 
n(n-l)(n-2)  , 

n(n  —l)q^~^p* 
n(n-l)(n-2)  , 

0 

n(n  —  l)(n  — 2)q»»~'p* 

1  •  2  •  3 

1.2        ^  ' 

 _  qn  3p3 

n-1 
n 

•      •  * 

pn 

•      •  • 

n(n  — 1)qp'»~i 

•       t  • 

n(n  —  l)(n  -2)gp«-i 
n(n  —  l)p^ 

•     •  • 

n(n  -l)(n  -2)(n  -3)«p'»-» 
n(n  — l)(n  — 2)p'» 

Total 

np(q  ~\-p)»~^  —  np 

n(n  -I)p2(g4-p)"~2 
=  n(n  — l)p2 

n(n  — l)(n  —  2)p«(9+p)"~* 
=  n(n-l)(n-2)p« 

The  total  of  column  (2)  of  the  table,  2/(X),  is  obviously  unity  since: 


2/(X)  =  q-  +  nq^-'p  +    \,2  +  •  •  •  +  p'*  =  (g  +  p)'*  =  1 

The  total  of  column  (3),  1iXf{X),  is  easily  recognized  if  one  takes  the 
common  factor,  np,  out  of  every  term.  Thus: 
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=  np{q  +  p)"""^  =  np 

Likewise  columns  (4)  and  (5)  may  be  reduced  to: 

SX(X  -  l)f{X)  =  n(n  ~  +  p)n-2  =  ^(^  ^  1)^2 

SX(X  -  1)(X  -  2)/(X)  =  n(n  ~  l)(n  -  2)p3(g  +  p)n-3 

=  n(n  —  l)(n  —  2)p^ 

We  therefore  have: 


2/(X) 
Since 


np 


SX(X  -  1)/(Z)  =  SX2/(X)  -  2Z/(X)  =  n(7i  -  l)p^ 

(from  Table  88),  we  have: 

SX2/(X)  =  n(n  -        +  SX/(X)  -  n{n  -  1)^^  +  np 

=  n^p^  +  np(l  —  p)  =  n'^p'^  +  npg  =       +  npq 


But: 


Hence: 


2/(X) 

^2   L — o  ^       =  npq 


or 


<7  =  V  npg 

Similarly: 

2X(X  ~  1)(X  -  2)/(X)  =  SX3/(X)  ~  3SX2/(X)  +  2SX/(X) 

=  2X3/(X)  -  3(nV  _{_  ^p^)  +  2np 
=  SXy(X)     3n2p2     3^p^  4.  27ip 
=  n(n  -  l)(n  ~  2)p3    (from  Table  88) 

Therefore: 

SX3/(X)  =  nV  +  3nV(l  -  P)  +  3npg  +  2np3  -  2np 

=  n^p^  +  3n^p^g  +  3npg  +  2np^  —  2np 

Using  the  formula  for  1/3  given  on  page  162  for  the  case  in  which 
=  1^    =  0,  AT  =  2/(X)  =  1,  we  have: 

=  SX3/(X)  -  3SX2/(X)M  +  2M^ 

Substituting  the  values  given  above: 

Vz  =  3npg  4-  2np^  —  2np  =  np(3g  +  2p^  —  2) 
=  np(3  ~  3p  +  2p2  -  2)  =  np(l  -  p)(l  -  2p) 


CHARACTERISTICS  OF  THE  POINT  BINOMIAL  389 


and  finally 


Hence: 


Pi  or  fJLs  =  npq{l  —  2p)  =  npq{q  —  p) 


as 


_  Ms  _  npqjq  -  p)  _  q  -  p 


(J^  {npq)i 

Collecting  these  results  we  have 

M  :=  np 


(T  =  Vnpq 

q  -  p 

as  =  — 

a 


(3) 


where  the  positive  direction  is  that  of  increasing  X. 

The  equation  M  =  np  shows  that  for  the  point  binomial,  (g  +  p)**, 
the  mean  value  is  equal  to  the  expected  value.  The  value  of  as 
shows  that  the  skewness  is  positive  when  p  is  less  than  q,  is  negative 
when  p  is  greater  than  g,  and  is  zero  when  p  equals  q. 

In  the  next  list  of  exercises  we  ask  the  student  to  show  that  when 
n  becomes  infinite  in  the  point  binomial  (q  +  p^j  the  skewness 
approaches  zero,  and  the  kurtosis  {a^  —  3)  also  approaches  zero. 
We  have  stated  in  Chapter  5  that,  for  a  normal  distribution^  as 
equals  zero  and  equals  3.  Thus  w^e  see  that  as  n  increases  the 
moments  (of  order  less  than  5)  of  the  point  binomial  approach  the 
same  moments  of  the  normal  distribution. 


Values  of  as  and  a4 
FOR  (.98  +  .02)'^ 


n 

otz 

100 

.68 

3.45 

200 

.48 

3.23 

300 

.40 

3.15 

400 

.34 

3.11 

500 

.31 

3.09 

600 

.28 

3.075 

700 

.26 

3.06 

800 

.24 

3.06 

900 

.23 

3.05 

1000 

.21 

3.045 

Tbe  rapidity  with  which  as  approaches 
zero  and  oti  approaches  3  as  n  increases, 
even  for  the  case  where  p  is  extremely 
small,  is  shown  by  the  accompanying 
table. 
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EXERCISES 

1.  Plot  the  histograms  and  the  frequency  polygons  for  the  binomials 
following.  Find  for  each  binomial  the  Mo,  M,  a,  and  as. 

a.  ii  +  iy     b.  (f  +  i)^     c.  i^  +  ^y     d.  (f +  ?)^ 

2.  By  extending  Table  88  show  that: 

2X(X  -  1)(X  ~  2){X  -  3)/(X)  =  n{n  -  \){n  -  2)(n  -  3)^^ 

3.  Using  the  value  of  v^,  given  on  page  162,  show  that  for  the  point  bi- 
nomial: 

M4  =  f4  =  npq[l  +  3pq(n  —  2)] 

4.  Show  that     for  the  point  binomial  is  given  by: 


«4  =  o  -h  - 


cr^  n 


6.  Show  that 


(np) !  {nq) ! 


reduces  to 


1  np 


V  2irnpq  <7V27r 

when  Stirling's  formula  is  applied. 

6.  Show  that  \{  np  —  q  is  an  integer 

P       —  P 

np— g  np+p 

Hint.  (1)  Let  np  —  q  =  k,  then  np  +  P  =  A;  +  1  from  which  obtain 
(n  -  A;)/(A;  +  1)  -  g/p.  (2)  Show  that  Pnp+p  =  P^+i  =  (n  -  A;)/(/b  +  1) 
•  p/q  •  Pjfc.   (3)  Combine  the  results  of  (1)  and  (2). 

7.  Show  that  as  n  becomes  infinite,  0:3  equals  zero  and  0:4  equals  3. 

8.  Verify  the  values  of  the  table  for  (.9  +  .1)". 


n 

^8 

«4 

100 

.2667 

3.0511 

200 

.1886 

3.0256 

1000 

.0843 

3.0051 
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The  following  exercises  are  for  students  of  the  calculus. 


9.  Show  that 


dx 


■Xq  +  pxy 


x  =  l 


,{i  =  0,  1,  2,  3) 


give  the  totals  of  columns  2,  3,  4,  and  5  of  Table  88. 
10.  Show  that  the  moments  of  {q  -\-       given  by 


d 


fix  =  — (2}e«^  +  o'e-p^)" 
ax* 


x  =  (i 


,  a  =  1,  2,  3,  4) 


are  the  same  as  we  have  given  in  the  text.   This  relationship  was  given 
by  Karl  Pearson  in  Biometrika^  Vol.  XII,  p.  270. 
11.  The  moments  of  {q  +  pY  can  be  obtained  from 


m/Xt_i  — 


dq^ 


recalling  that  mo  =  1  and  mi  =  0.  Use  this  relation  ^  to  establish  the  values 
given  in  the  text. 


100.  THE  POINT  BINOMIAL  APPLIED  TO  FREQUENCY 

DISTRIBUTIONS 

It  should  be  emphasized  that  the  terms  of  (1)  represent  probabilities 
and  that  their  sum  is  unity.  By  Section  95  (p.  374),  if  the  terms  of 
(1)  are  multiplied  by  some  suitable  number,  the  several  terms  will 
then  represent  frequencies.  Thus,  if  10  coins  are  thrown  1,024  times, 
the  terms  of  the  expansion 

l,024(i  +       =  1  +  10  +  45  +  •  •  •  +  252  +  •  •  •  +  10  +  1 

represent  the  expected  number  of  times  that  we  should  obtain  0,  1,2, 
. . . ,  5,  . . . ,  9,  10  heads,  that  is, 

expected  frequency  of   X=  (1024)ioC;,(i)^o-^(J)^ 

An  experiment  in  which  10  coins  were  thrown  1,024  times  was 
performed,  and  the  actual  results  together  with  the  theoretical  or 
expected  results  are  shown  in  Table  89. 

^  See  article  by  A.  T.  Craig,  Bulletin  of  the  American  Mathematical  Society ^ 
Vol.  40,  p.  262. 
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Table  89.  Actual  and  Expected  Results 
IN  Tossing  10  Coins  1,024  Times 


fy^nno  TTnn 
12  CUUiO  U  fJ 

Y 

AawA 

ji/xpecied 

J  K'^) 

0 

2 

1 

1 

10 

10 

o 

'±0 

3 

116 

120 

4 

205 

210 

5 

257 

252 

6 

216 

210 

7 

126 

120 

8 

42 

45 

9 

8 

10 

10 

2 

1 

Toial 

1,024 

1,024 

Figure  54 


X  =  Number  of  Heads 
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When  the  data  of  Table  89  are  plotted  as  in  the  Figure  54,  and 
the  frequency  polygons  are  drawn,  the  differences  between  the  6h^ 
served  and  the  expected  frequencies  are  seen  to  be  slight.  These 
differences  may  be  the  result  of  many  causes,  such  as  the  lack  of 
homogeneity  of  the  coins,  the  faulty  methods  of  tossing  them,  and 
what  are  usually  known  as  variations  due  to  chance. 

The  statistical  constants  for  the  observed  and  the  theoretical 
distributions  of  Table  89  are  given  in  Table  90.  The  constants  for 
the  theoretical  distribution  were  computed  by  (3)  and  Exercise  4 
on  page  390,  whereas  those  for  the  distribution  of  observed  values 
were  computed  by  the  methods  of  Section  44  (p.  164). 


Table  90 


Distribution  of 
Theoretical  Values 

Distribution  of 
Observed  Values 

M 

5.0000 

5.0283 

a 

1.581 

1.567 

0.0000 

-  0.0499 

«4 

2.8000 

2.9246 

In  a  similar  manner  any  distribution  of  observed  values  can  be 
more  or  less  approximately  reproduced  by  multiplying  the  terms  of 
the  expansion  of  (q  +  by  the  total  frequency  N.  If  the  distribu- 
tion is  nearly  symmetrical,  we  take  p  =  q  =  ^  and  n  such  a  number 
that  the  (n  +  1)  terms  of  the  expansion  when  multiplied  by  N  will 
give  (n  +  1)  theoretical  frequencies. 

Thus,  let  us  consider  the  following  distribution  of  the  heights  of 
750  college  men.  Since  distributions  of  heights  of  men  are  known 
to  be  closely  symmetrical,  we  choose  p  —  q  =  Also,  since  there 
are  14  classes  of  heights  ranging  from  61  inches  to  74  inches  inclusive, 
we  choose  n  =  13.  Hence  the  terms  of  the  expansion  750(i  +  i)^^ 
give  14  theoretical  frequencies.  The  following  table  exhibits  the 
frequency  distributions  of  theoretical  and  observed  values.  The 
theoretical  frequency,  for  a  given  Z,  is  750i3CA'(i)^^""'^(i)^. 

Exercise.  Compute  values  of  Af ,  cr,  as,  and  0:4  for  the  two  distributions 
of  Table  91  and  thus  make  a  comparison  of  their  moments. 
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Table  9L  Observed  and  Binomial  Frequencies 
OP  THE  Heights  of  750  College  Men 


Heioht 

Ooserved 
fiX) 

Binomial 

61 

0 

2 

0 

62 

1 

4 

1 

63 

2 

10 

7 

64 

3 

32 

26 

65 

4 

63 

66* 

66 

5 

103 

118 

67 

6 

146 

157 

68 

7 

143 

^  turn 

157 

69 

8 

111 

118 

70 

9 

75 

66* 

71 

10 

35 

26 

72 

11 

12 

7 

73 

12 

3 

1 

74 

13 

1 

0 

Total 

750 

750 

*  This  value  was  65.5. 


Comparing  the  observed  with  the  theoretical  frequency  it  is  of 
course  noted  that,  for  a  given  value  of  X,  the  observed  frequency 
differs  from  the  theoretical  frequency.  Even  the  most  scrupulous 
among  us  are  not  surprised  at  these  differences.  However,  the  student 
may  properly  inquire  as  to  just  how  large  such  differences  may  be. 
This  is  one  of  the  fundamental  questions  to  which  we  shall  give 
attention  in  Chapter  13  when  we  consider  the  problem  of  sampling. 

For  a  given  and  n,  the  theoretical  distribution  N{q  +  p)"" 
obviously  depends  upon  the  value  of  p  or  q.  The  value  of  p  may  be 
determined  a  priori  as  in  dice-throwing  or  coin-tossing  experiments, 
or  it  may  be  determined  empirically  from  experiment  or  observation 
as  in  the  probabilities  of  life  and  death.  When  p  is  determined  em- 
pirically, it  is  influenced  by  sampling  errors.  Other  samples  of  the 
same  size  chosen  from  the  same  universe  will  not  yield  the  same 
values  of  p,  and  consequently  the  goodness  of  the  theoretical  distribu- 
tion iV(g  +  p)**  for  graduation  purposes  will  depend  upon  the  ac- 
curacy of  p. 
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The  binomial  distribution  +  p)"  was  the  first  theoretical  dis- 
tribution to  be  established.  It  was  first  discussed  in  Ars  Conjectandi 
(published  posthumously  in  1713)  by  James  Bernoulli  and  thus  any 
discrete  distribution  with  frequencies  proportional  to  the  terms  of 
the  expansion  is  frequently  called  a  Bernoulli  Distribution.  In  fact, 
what  we  have  called  the  Repeated  Trials  Theorem  is  frequently  called 
The  Bernoulli  Theorem. 

EXERCISES 

1.  Table  A  below  gives  the  I.Q.'s  of  905  school  children.  Table  B 
gives  the  weights  of  1000  school  children.  Graduate  Table  A  by  the 
expansion  905(|  +      and  Table  B  by  1000(J  +  ^)«. 


A  B 
Table  Table 


X 

fix) 

X 

fix) 

60.5 

3 

29.5 

1 

70.5 

21 

33.5 

14 

80.5 

78 

37.5 

56 

90.5 

182 

41.5 

172 

100.5 

305 

45.5 

245 

110.5 

209 

49^ 

263 

120.5 

81 

53.5 

156 

130.5 

21 

57.5 

67 

140.5 

5 

61.5 

23 

65.5 

3 

Total 

905 

Total 

1000 

M  =  100.95  M  =  47.71  pounds 

<T  -  13.0  <T  =   5.88  pounds 


101.  THE  NORMAL  CURVE:  INTRODUCTORY  REMARKS 

In  preceding  chapters  we  have  described  frequency  distributions 
by  three  methods:  the  graphical  method,  the  method  of  moments, 
and  the  point  binomial.  The  graphical  method  is  a  mere  pictorial 
representation  of  the  tabulated  data  and  is  inadequate  statistically 
because  it  is  only  a  picture.  The  method  of  moments  is  a  refined 
method  which  is  adequate  for  many  purposes,  especially  for  purposes 
of  comparison,  when  ilf,  cr,  as,  and     are  computed.   The  binomial 
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distribution  is  still  a  step  forward.  It  gives  us  an  equation  for  writing 
down  the  theoretical  frequency  for  a  given  integral  value  of  Z, 
and  the  estimated  sum  of  such  frequencies  between  certain  specified 
limits.  Thus,  theoretically  at  least,  the  point  binomial  provides  all 
the  advantages  that  accrue  from  an  equation. 

Practically,  the  point  binomial  is  unsatisfactory  for  two  important 
reasons.  First,  it  is  a  discontinuous  function,  being  strictly  defined 
only  for  integral  values  of  X.  Second,  when  n  is  large,  its  use  in 
answering  many  questions  entails  so  much  labor  as  to  render  it 
unfit  for  practical  usage.  We  seek,  therefore,  a  continuous  function 
having  approximately  the  same  ordinates  as  the  binomial  series  and 
which  is  so  well  tabulated  that  important  questions  in  probability 
can  be  answered  by  its  use  without  the  tedium  of  undue  labor.  The 
simplest  continuous  function  that  meets  our  needs  is  the  normal  or 
Gaussian  function,  whose  general  equation  is: 

y  =  Ce-''^' 

Here  e  is  the  base  of  the  natural  or  Napierian  system  of  logarithms 
whose  value  is  2.71828.  .  ..  The  constant  C  determines  the  maxi- 
mum height  of  the  curve  and  the  constant  h  its  spread. 

As  was  stated  in  Section  89,  the  normal  or  Gaussian  curve  was 
first  established  by  De  Moivre.  A  proof  was  also  given  by  Laplace 
at  a  later  date  and  hence  the  curve  is  sometimes  called  the  Laplacean 
curve.  Gauss  approved  the  law,  used  it,  and  gave  an  original  proof 
of  it.  Thus,  the  normal  law  began  its  early  life  with  a  rare  hereditary 
background.  No  wonder  the  lesser  lights  of  the  first  half  of  the  nine- 
teenth century  claimed  for  it  a  value  that  was  undeserved,  con- 
sidered it  to  be  'Hhe  ideal  curve,''  and  demanded  an  explanation  if  a 
distribution  did  not  obey  it. 

The  writers  in  the  latter  half  of  the  nineteenth  century  seem  to 
have  been  more  careful  that  their  enthusiasms  did  not  outrun  the 
facts,  for  as  data  from  many  fields  accumulated  it  became  general 
knowledge  that  the  normal  curve  is  but  one  of  a  number  of  types 
of  curves  which  are  used  to  describe  frequency  distributions.  So 
we  must  not  assume  that  a  non-normal  distribution  is  abnormal'' 
in  the  usual  sense  of  the  word. 

The  normal  curve,  however,  is  by  far  the  most  important  type; 
further,  its  importance  seems  to  have  increased  within  recent  years. 
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and  the  history  of  the  theory  of  statistics  may  date  from  its  discovery 
by  De  Moivre  in  1733.  There  are  good  reasons  why  this  is  so. 

First,  it  is  a  continuous  function. 

Second,  the  normal  curve  lends  itself  well  to  mathematical  treatment. 
That  is,  it  possesses  properties  that  are  mathematically  elegant,  com- 
paratively simple  to  derive,  and  expressible  in  simple  forms. 

Third,  a  large  number  of  distributions,  mound-shaped  in  appearance,  are 
approximately  of  the  normal  form  and  may  be  subjected  to  normal  curve 
analysis  as  a  first  approximation. 

Fourth,  many  sampling  distributions,  such  as  distributions  of  means, 
distributions  of  standard  deviations,  and  others  are  of  the  normal  form 
exactly  or  to  a  satisfactory  degree  of  approximation.  Thus,  the  formulas 
for  determining  the  reliability  of  a  statistical  function  ^'lean  heavily  upon 
this  law." 

Fifth,  of  two  well-known  systems  of  generalized  frequency  curves,  one 
of  them,  that  developed  by  Gram,  Thiele,  Charlier  (known  as  the  Scandi- 
navian school),  is  based  upon  the  normal  curve  as  a  generating  function. 

A  development  of  the  theory  of  generalized  frequency  functions, 
though  an  important  and  attractive  study,  is  so  severe  in  the  mathe- 
matical background  required  to  comprehend  it  that  its  inclusion 
in  our  elementary  study  would  seem  inappropriate.  However,  a 
derivation  of  the  normal  curve  and  a  study  of  its  properties  are  so 
essential  to  the  study  of  elementary  statistical  analysis  that  their 
inclusion  in  our  text  seems  mandatory. 

102.   DERIVATION  OF  THE  EQUATION 
TO  THE  NORMAL  CURVE 

Figure  51  (p.  383)  shows  the  frequency  polygon  for  the  point 
binomial  (J  +  2)^^-  I'he  eleven  points  are  symmetrically  distributed 
about  the  vertical  line  through  X  =  M  =  np  =  5.  In  like  manner  if 
(2  +  2)'*  plotted  for  any  n,  the  points  will  be  symmetrically  dis- 
tributed about  the  vertical  line  through  X  =  M  =  np  =  n/2  since 
p  =  q  and  =  0.  Now  if  n  be  allowed  to  increase  indefinitely  the 
polygon  of  (71  +  1)  vertices  and  (n  +  2)  sides  will  approach  a  smooth 
curve,^  the  normal  curve,  symmetrical  to  the  vertical  line  through 

^  As  n  increases,  it  becomes  necessary  to  reduce  the  X-scale  to  keep  the  dia- 
gram within  reasonable  dimensions.  We  are  interested  in  confining  the  range 
to  an  interval  of  three  or  four  standard  deviations  from  the  mean.  Consequently, 
we  assume  that  n  increases  and  (Ax)  decreases  in  such  a  way  that  n{Axy  always 
equals  a  constant  2<t^. 
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X  «  Af.  In  other  words,  the  normal  curve  is  the  limit  of  the  point 
binomial  (|  +      as  n  becomes  infinite. 

The  proof  of  the  statement  above  is  facilitated  by  assuming  that 
n  is  even  and  by  employing  the 

Lemma,  If  the  several  terms  of  the  expansion  +  J)^"  be  plotted 
as  ordinates  at  intervals  of  AX  along  the  X-axis,  2nCo/2^''  being  taken 
at  the  origin,  so  that  the  abscissas  of  2nCi/2'^''y  2nC2/2^",  .  .  .  ,  2nCn/2'^'', 
.  .  .  ,  2nC2n/22'^  are  AX,  2AX,  .  .  .  ,  nAX,  .  .  .  ,  2?zAX,  then : 


M  =  nAX  and  a' 


n 


AX 


The  proof  of  this  lemma  is  identical  in  method  to  that  used  in 
Section  99B  (p.  387),  hence  its  derivation  will  be  left  as  an  exercise 
for  the  student. 

Let  us  consider  then  the  expansion: 

(^  +  ~J'*-^[l+2nCl  +  2nC2  +  -  '  '  +2nCn+  '  '  '  +  2nCn+r  +  '  '  «  +  1] 

Let  us  plot  the  terms  of  this  expansion  as  ordinates  at  equal 
intervals  AX  along  the  X-axis  beginning  with  the  first  term  at  the 

C 

origin.  The  maximum  term  is  evidently        which  we  erect  at  the 


DERIVATION  OF  THE  EQUATION  399 

mean,  0\  We  plot  the  other  terms  with  respect  to  this  new  origin. 
Evidently  Ax  =  AX.  Let  P(x,  y)  and  Q{x  +  Ax,  y  +  Ay)  be  the 
successive  vertices  of  the  polygon  which  are  determined  by  the  rth 
and  the  (r  +  l)th  terms  from  the  middle  term  of  the  above  expansion. 
Then  the  ordinates  of  the  points  are: 

y  =  y  +  Ay  ^ 

Since 

(71/  —  V  \ 
— j  —  )    (see  Exercise  13  on  p.  374) 

we  have: 

4^  =  X 2r  -  1  \ 
Ax     Ax\n  +  r+  1/ 

The  abscissa  of  P  is  x  =  rAx;  hence: 

_  x 
~~  Ax 

Consequently: 

Ay  ^  _  /  2x  +  Ax  \ 
Ax  \nAx^  +  xAx  +  Ax) 

From  the  lemma  above  we  have: 


nAx  =  2(7-,  a  constant 

Therefore: 

Ay  /       2x  +  Ax 


/  2x  +  Ax  \ 
\2(r2  +  xAx  +  Ax  / 


Ax 

Now  let  n  become  infinite  and  Ax  approach  zero.^  We  then  have 

dy  ^  xy 

dx  <T^ 
which,  upon  integration,  reduces  to: 

y  ^  Ce  2<r»  =  C^"'*'** 
where  '^^^  called  the  index  of  precision. 


*  See  footnote  page  397. 
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In  order  to  make  this  curve  statistically  useful,  we  shall  assume 
that  the  area  under  the  curve  is  equal  to  the  area  of  the  histogram, 
Nw^  where  w  is  the  class  width  and  is  the  total  frequency.  That  is, 
we  assume 

ydx  =  Nw 


pec 

%J  —00 


from  which  it  follows,  using  the  well-known  relation:  [Ex.  1,  p.  404] 


f*<X> 

I  * 

%/  —00 


Nw 


Substituting  this  value,  we  have  the  equation  to  the  normal  frequency 
curve: 


Nw 


X 


2 


y  =  — 7=  ^  2<^2  (4) 

(7v27r 

It  must  be  emphasized  that  in  equation  (4)  x  is  the  deviation  of 
the  frequency  y  or  f{x)  from  the  mean.  By  replacing  x  by  its  equal 
X  —  M  we  may  express  the  equation  in  the  form: 

Nw  _(x-M)^ 
Y  =  — r=  e      2<.2  (5) 
o-v27r 

If  in  (4)  we  make  the  area  under  the  curve  equal  to  unity,  the  equa- 
tion reduces  to  the  normal  probability  curve: 

1 

y  =  — 7==  e  (6) 

which  gives  the  probability  of  any  deviation  x. 

It  is  customary,  due  to  the  simplicity  of  application,  to  express  the 
deviations  in  standard  units,  that  is,  to  make  c  the  unit  for  measuring 
deviations.   If  in  (4)  and  (5)  we  place 

^  =  £  =  X  -  M 
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we  obtain: 


Finally  we  write: 


where 


(Read:  phi  of  tee.) 


Nw 

y  =  r==  e  a 


y  = 


o'V27r 
Nw 


(7) 


CTx 


1 


V2 


TT 


(8) 


(9) 


103.   SOME  PROPERTIES  OF  i 

Values  of  and  of  the  areas  bounded  by  0(0,  the  ^axis,  and 
certain  ordinates  are  tabulated  in  Appendix  B.  The  graph  of  (f>{t) 
is  shown  in  the  accompanying  figure  which  is  drawn  from  the  values 
in  Table  92. 

Figure  56 


t=-S 
etc. 


X=-^2(T      X——<T  x^O 

etc.        X=M-(r  X=M 


i=^l  t=^2 
x~<T  ic=2a 

X=M+cr  etc. 


t=S 

X—S<T 

etc. 


Since  —  t  yields  the  same  value  to  (t>{t)  as  +  ^,  that  is,  since 
</)(—  t)  —  0(0,  the  curve  is  symmetrical  with  respect  to  the  vertical 


*  Several  of  these  properties  require  the  calculus  for  proofs. 
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t 

0 

.3989 

0.5 

.3521 

LO 

.2420 

1.5 

.1295 

2.0 

.0540 

2.5 

.0175 

3.0 

.0044 

line  through  <  =  0.  It  is  therefore  not  necessary  to  tabulate  negative 
values  of  t.   Since  the  total  area  under  4>{t)  is  1.0000,  the  area  on 

either  side  of  the  vertical  line  of  symmetry  is 
Table  92         0.5000.   Therefore  the  median  coincides  with  the 

mean.   The  largest  value  of  </)(0  is  that  for  which 
t  =  0,  therefore  the  mode  coincides  with  the 
mean.   There  is  no  finite  value  of  t  for  which 
(t>{t)  =  0,  but  (t>{t)  is  relatively  small  for  values  of 
t  outside  of  ^  =  =b  3.   It  is  because  of  the  last- 
mentioned  fact  that  the  normal  curve  can  be  used 
to  represent  finite  distributions.   As  a  matter  of 
fact  the  combined  area  of  the  two  tails  beyond 
<  =  -  3  and  ^  =  +  3  is  only  0.0026,  and  the 
combined  area  of  the  two  tails  beyond  ^  =  —  4  and  ^  =  +  4  is 
0.000,064.    The  curve  crosses  its  tangent  at  i  =  d=  1,  </)(0  =  .2420. 
These  are  called  inflection  points. 

The  areas  of  certain  portions  of  0(0  are  so  important  in  statistical 
analysis  that  we  must  not  fail  to  emphasize  them.  We  shall  use  the 

symbol  -A^^jJ^^  or,  more  briefly,  A^]^  to  mean  ^^the  area  under  0(0 

from  ^  =  a  to  ^  =  b.^^  Thus,  we  have  from  the  table  A^^J  =  .3413, 

A^'^l  =  .4773,  A<^]q  =  .4987.  By  the  simple  addition  and  subtrac- 
tion of  areas  we  also  have 

A^J^  =  .1360,  A^J^  =  .0214,  A^]_?  =  .3413,  A^~]_l  =  .8186. 

The  statement  A^Jq  =  .3413  means  that  between  the  ordinates 

erected  at  ^  =  0  and  ^  =  1  is  included  34.13  per  cent  of  the  total 
area  under  the  curve.  More  broadly  interpreted,  it  means  that  for 
a  normal  frequency  distribution  about  one-third  of  the  total  frequency 
is  found  between  the  mean  and  x  =  cr  (see  p.  135).  In  the  language 
of  probability,  the  statement  means  that  the  chance  is  approximately 
1/3  that  a  measure  selected  at  random  from  a  given  distribution  of 
variates  normally  distributed  will  fall  within  the  interval  between 
t  =  0  and  <  =  1,  or  between  x  =  0  and  a;  =  cr  or  between  X  =  M 
and  X  =  M  '\'  <T, 

It  will  be  left  as  exercises  for  the  student  to  interpret  the  other 
areas  illustrated  above. 
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The  value  of  t  that  satisfies  one  of  the  equations 

A^y^  =  .2500  A^']t[  =  .5000  (10) 

defines  one  of  the  most  important  concepts  found  in  statistics.  The 
value  of  t  defined  by  either  of  the  given  equations  (10)  is  called  the 
probable  erroVy  of  a  single  observation.  The  probable  error,  Ey 
is  that  distance  which,  when  laid  off  on  either  side  of  the  mean  of  a 
normal  curve,  defines  an  interval  such  that,  if  ordinates  are  erected 
at  its  end  points,  the  area  included  by  the  ordinates,  the  curve,  and 
the  base  line  is  one-half  the  total  area  under  the  curve.  Stated  some- 
what differently,  the  probable  error  of  a  distribution  of  variates 
normally  distributed  may  be  defined  as  that  deviation  on  either  side 
of  the  mean  within  which  exactly  half  the  variates  lie.  Since  half  the 
total  frequency  lies  within  the  interval  M  —  E  to  M  +  E,  if  any 
one  variate  be  selected  at  random  from  the  given  variates  there 
is  an  even  chance  that  the  selected  variate  falls  within  the  given 
interval  M  —  E  to  M  +  E  or  without  it. 

For  an  approximate  solution  of  equation  (10)  let  us  interpolate  be- 
tween ^  =  .67  and  ^  =  .68.   The  solution  is: 

'A^yj  =  .2486' 

A^Jq   =  .2500 


.01 


.0014 


=  -2518 


.0032 


.0014 


.01  .0032 
z  =  .0044 

and  t  =  .67  +  z  ^  .6744.  More  extended  tables  lead  to  the  more 
accurate  value 


X 

<  =  ^  =  .6745  (approximately) 


and  therefore 
that  is: 


X  =  .6745(7 
Ex  =  .6745(7x  (11) 

If  a  distribution  is  not  normal,  its  probable  error  is  estimated  by 
equation  (11). 

Figure  57  will  assist  in  clarifying  the  concept  of  probable  error. 
The  values  of  the  moments  of  the  normal  curve  are  given  in 
Exercise  8,  page  405. 
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8E  -~2E 


EXERCISES 

1.  Find  the  portions  of  the  area  under  4>{t)  indicated,  and  draw  a  figure 
in  each  case. 

a.  A<^]_^  d.  A<^]2  g.  ^</»Dll 

e.  A<^]_2l         h.  A^^l.n 


b.  ^'^ 


2.4 
2.389 


-  2.746 
3.468 


C.  ^<^J_24 

2.  Find  t  in  the  following  equations: 

b.  =  .4844  d.  =  .4878 

3.  Verify  the  percentages  of  Figure  57,  in  which  E  is  taken  as  the  x-unit. 
The  following  exercises  are  for  students  of  calculus. 


4.  Prove: 


-00 


'dx  =  Vtt 


Hint:  Let 


-00  ^  —00 


or 


/QO  /»0O  /*x  /*00 

-00  t/  —00  —abt/  —00 
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fvhich  is  the  volume  under  the  surface  z  =  e"'^^'^'^.   Change  to  polar  co- 

ordinates.  Then  7^  =  4  /  2  /    e'^^rdrdO  =  tt. 

J Q  Jo 

5.  Show  that  y  in  (4)  has  a  maximum  at  a;  =  0. 

6.  Show  that  y  in  (4)  has  inflection  points  at  x  =  =b  <r. 

7.  Consider  equation  (4).    Show  that  the  mean  deviation  about  the 

1  T"* ,  ,       2  /2 

mean  =  -—  I     b  wda;  =  — I    xy  dx  -  \  -a  =  0.79788  •  •  •  (r. 
NwJ -00  NwJo  V  TT 

8.  Evaluate  the  moments  of  the  normal  curve   (4),  where  /i*  = 


1 

-rz—  I     fy  dx.  That  is,  show  that 

NwJ  -00 


M2  = 


Mo  =  1,      Ml  =  0, 
oo  =  1,      ai  =  0,      a2  =  1, 


«2n 


1  •  3  •  5 


*    •  • 


M3  =0,  M4  =  3m1  =  3(7^ 
as  =  0,     a4  =  3 

(2n  -  1)  = 


2«(n!) 

a2n+l  =  0 

9.  Show  that  for  the  normal  curve 

Mean  Deviation  about  M  =  1.183  Probable  Error 
Probable  Error  =  0.8454  Mean  Deviation 


104.   ILLUSTRATIVE  EXAMPLES 

Example  1.  Given  a  normal  distribution  with  N  =  1,000,  =  2, 
M  —  16,  and  <r  =  4:  a.  How  many  variates  fall  between  X  =  12  and 
X  —  20?  b.  How  many  lie  above  X  =  26?  c.  How  many  lie  below 
X  =  10? 

Figure  58 


X  scale  4 
X     »^  —12 
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Figure  58  shows  a  normal  curve  with  Af  =  16,  er  =  4,  area  =  (1000)2, 
Since  our  tables  are  expressed  for  values  of  ty  we  must  transform  our  data 
into  t  units.  We  have  shown  three  scales  on  the  base  line.  If  X  =  12, 
X  ^  X  -  M  =  12  -  16  =  -  4  and  i  =  a:/(r  =  -  4/4  =  -  1.  Similarly, 
if  Z  =  20,  ^  =  1;  if  Z  =  26,  t  =  2.5,  and  if  Z  =  10,  <  =  -  1.5. 

a.  Now  -^<^]-i  =  -6826 

This  means  that  68.26  per  cent  of  the  total  area  under  the  curve  lies  between 
i  =  —  1  and  <  =  1,  or  between  X  =  12  and  X  =  20.  By  means  of  the 
calculus  it  can  be  shown  that  the  area  under  Y  from  Xi  to  X2  or  under 
y  from  Xi  to  X2  is  Nw  X  the  area  under  <t){t)  between     and      that  is: 

Ay]1\  =  Ayyj^  =  Nw'A^2t  See  equations  (4),  (5),  and  (8).  Therefore: 

^i^]x  =  i2  =  .6826(1000)2  =  (682.6)2 

Since 

2,000  units  of  area  represent  1,000  variates, 
(682.6)2  units  of  area  represent  682.6  variates. 

That  is,  682.6  variates  fall  between  X  =  12  and  X  =  20. 

In  short,  since  ^</,]_J  =  .6826 
we  may  say  that  68.26  per  cent  of  N  or 

.6826(1,000)  =  682.6 

variates  fall  between  X  =  12  and  X  =  20. 

b.  Similarly,  since  ^^^Xs  =  -^^^     -^^^^  =  -^^^2, 

.0062(1,000)  =  6.2 
variates  are  beyond  X  =  26. 

c.  Since  A^']Zl,^  =  .5000  -  .4332  =  .0668, 

.0668(1,000)  =  66.8 

variates  are  below  X  =  10. 

Example  2.  For  the  distribution  described  in  Example  1,  compute  Y 
when  X  =  4,  8,  12,  16,  20,  24,  28. 

Using  (5),  the  equation  of  the  curve  is: 

(1000)2  ^iKzm 

 ~r  e  32 

4V27r 

X  ^  X  -  M  ^  X  -  16 
a         a  4 


Let: 


Y  = 
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Then: 

F  =  M?. -4=  e-f=  500^(0 
4  V27r 

Recalling  that  <t>{—  t)  =  4){t),  we  have  the  following  table  of  values. 


X 

t 

Y 

4 

-  3 

.0044 

2.2 

8 

-  2 

.0540 

27.0 

12 

-  1 

.2420 

121.0 

16 

0 

.3989 

199.4 

20 

1 

.2420 

121.0 

24 

2 

.0540 

27.0 

28 

3 

.0044 

2.2 

Example  3.  If  10  coins  are  thrown,  use  the  normal  probabiHty  function 
to  find  the  approximate  probabiHty  of  obtaining  exactly  7  heads. 

The  various  probabilities  are  given  by 
the  terms  of  (|  +  i)^^ 

The  exact  probability  of  obtaining  7 
heads  is  given  by: 

Pi  =  loC^imiY  =  .117 

We  may  apply  the  normal  curve  to 
obtain  an  approximate  value  of  Ft.  We 
have: 

M  =  np  =  10(1)  =  5 
<r  =  Vn^  =  ViOOKI)  ==  1.581 

1  1  . 

y  =  T=re  20-2  =  -<t){t)  gives  the  probability  of  any  deviation  x, 

<r  V27r  ^ 


We  seek  ?/  for  Z  =  7.    But  if  Z  =  7,  a;  -  X  ~  M  =  7  -  5  =  2  and 
2 

=  1.265.  Since  </>(1.265)  =  .1792,  we  have  therefore 


z 

^  "  ^  "  1.581 


1  1 702 

The  slight  discrepancy  in  the  two  results  is  an  evidence  that  the  point 
of  the  given  binomial  is  near  the  normal  curve. 
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Example  4.  Given  a  normal  distribution  with  M  =  75  and  cr  =  8, 
what  limits  will  include  the  middle  75  per  cent  of  the  total  frequency? 

We  must  solve  the  equation: 
Ay]  _l  =  .nNw 
or  the  equation 

A*]-!  =-75 

Since 

A^']_\  =  2C^^];  =  .75, 
we  have: 

A<^]q  =  .375 


From  the  tables 
and  therefore: 

Hence  the  limits  are  M  ±  a;  =  75  ±  9.20  =  65.80  and  84.20. 

In  approximating  a  mm  of  the  successive  terms  of  the  point  bi- 
nomial by  the  normal  curve,  we  must  find  the  area  under  the  ap- 
propriate part  of  the  curve.  The  sum  of  the  successive  terms  of  the 
binomial  equals  the  sum  of  the  areas  of  the  corresponding  rectangles 
of  the  histogram.  We  must  then  replace  the  rectangles  of  the  histo- 
gram by  corresponding  areas  of  the  curve  and  this  requires  that  we 
use  whole  rectangles^  not  half  rectangles  at  the  ends. 

It  is  evident  that  the  normal  curve  will  give  a  close  approximation 
to  the  sum  of  the  terms  of  a  binomial  only  when  'p  and  q  are  nearly 
equal,  and  n  is  fairly  large.  Certainly  if  there  is  considerable  skew- 
ness,  the  approximation  by  the  normal  curve  may  not  be  satisfactory, 
especially  near  the  ends  of  the  distribution.  We  cannot  make  definite 
statements  as  to  when  the  normal  curve  may  be  used  as  an  ap- 
proximation to  the  binomial.  Whether  the  approximation  is  satis- 
factory or  not  depends  upon  the  accuracy  of  the  results  desired  and 
how  the  approximation  is  to  be  used. 

Exercise  6.  If  10  coins  are  tossed,  what  is  the  probability  of  getting 
4,  5,  6  or  7  heads?   (a)  Use  the  theorem  of  repeated  trials  for  an  accurate 


<  =  ^  =  1.15 
(J 


X  =  1.15<7-  =  (L15)8  =  9.20 
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result  correct  to  two  decimals,  and  (b)  use  the  normal  curve  to  find  an 
approximate  result. 

Solution  to  (a).  By  the  theorem  of  repeated  trials  the  required  proba- 
bility is  the  sum  ^ioCx{i)'^{iy'''^.  This  sum  is 

4 


P  = 

Solution  to  (b). 


210  +  252  +  210  +  120 
1024 


792 
1024 


=  .77 


O 


n  =  10 

P  =  Q  =  i 

M  =  np  =^  5 

(T  —  Vnpq  =  1.58 

xi  =  3.5  -  5  =  -  1.5 

a;2  =  7.5  -  5  =  2.5 

_  Xi  _  —  1.5 

'  ~  7  ~  1.58 


\  5  6  7\X  scale 
tl  t  scale 


2.5 
1.58 


=  -  .95 
1.58 


Approximate  P  =  A^J^  =  .3289  +  .4430  =  .7719. 


105.   ON  THE  SIGNIFICANCE  OF  RESULTS 

It  has  been  observed  that  for  a  normal  or  a  moderately  skewed, 
mound-shaped  distribution  the  total  range  seldom  exceeds  six  times 
the  standard  deviation.  If,  then,  a  distribution  is  approximately 
normal,  it  is  not  expected  that  a  measure  chosen  at  random  will  show 
a  variation  of  more  than  three  times  the  standard  deviation,  on  either 
side,  from  the  mean.  A  divergence  of  more  than  it  3cr  (about 
±  4i,5E)  may  be  called  significant;  that  is,  other  forces  than  mere 
chance  have  most  probably  operated  to  bring  about  abnormal  re- 
sults. Thus  if  400  coins  are  tossed  (or  if  one  coin  is  tossed  400  times) 
what  is  the  allowable  variation  in  the  number  of  heads?   We  have: 

M  —  np  =  400 (^)  =  200  =  the  expected  number  of  heads 

and 

(7  =  =  V400(J)(i)  =  10 

3(7  =  30 

It  is  very  improbable  then  that  less  than  170  (==  200  —  30)  and 
more  than  230  ( =  200  +  30)  heads  will  appear.  In  fact  we  can  meas- 
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ure  the  probability  in  question.  Since  4<^]4  ^  -9974,  if  400  coins 
are  tossed,  the  probabiHty  of  obtaining  between  170  and  230  heads 
is  9,974/10,000.  That  is,  the  probabiHty  of  obtaining  more  than 
230  and  less  than  170  heads  is  26/10,000.  In  other  words,  the  odds 
in  favor  of  obtaining  between  200  =h  30  heads  are  9,974  to  26  or 
383.6  to  1. 

In  general,  we  may  state  that  the  probability  of  a  measure's  lying 
within  the  range  M  zk  Sa  or  M  ±  4.5^/  is  9,974/10,000  and  that 
the  odds  favoring  a  measure's  lying  within  this  range  are  nearly 
385  to  1. 

Another  type  of  language  has  become  fashionable  when  speaking 
of  certain  t  or  x  values  in  connection  with  the  normal  curve.   It  is 

seen  from  our  tables  that  ^^J^i  9^  =  .95  and  thus  5  per  cent  of  the 

area  lies  outside  the  Umits  ^  ==  ±  1.96  or  a:  =  db  1.96(7.  Conse- 
quently, there  is  1  chance  in  20  that  x  may  lie  outside  db  1.96a. 
This  value  1.96a  is  called  the  5  per  cent  level  of  significance.  Similarly, 

A^J_2.576  =  -99  and  thus  1  per  cent  of  the  area  lies  outside  the 

limits  ^  =  ±  2.576  or  X  =  ±  2.576a.  Consequently,  there  is  1  chance 
in  100  that  x  may  lie  outside  ±  2.576cr.  This  value  2.576<r  is  called 
the  1  per  cent  level  of  significance.  These  values  may  be  called 
confidence  limits^  the  probability  giving  a  measure  of  confidence  that 
an  item  falls  within  the  stated  limits. 

The  question,  At  what  probability  level  does  a  deviation  become 
significant?"  is  one  that  cannot  be  answered  with  scrupulous  exact- 
ness. Statisticians  differ  in  their  credulity.  Any  level  that  is  set  is 
arbitrary.  Conceivably,  a  deviation  x  may  be  any  amount.  How- 
ever, the  occurrence  of  the  deviation  may  be  so  unlikely  that  it  can 
hardly  be  looked  upon  as  due  to  chance.  Some  authorities  state 
that  if  X  is  outside  the  5  per  cent  level  it  is  significant;  if  it  is  outside 
the  1  per  cent  level,  it  is  highly  significant.  A  safe  procedure  for  the 
student  is  that  he  be  prepared  to  state  in  terms  of  probability,  or 
as  a  percentage,  the  level  of  significance  for  any  deviation. 

Questions.  What  are  the  values  of  t  and  x  for  the  10  per  cent  level 
of  significance? 

What  are  the  values  of  t  and  x  for  the  25  per  cent  level  of  significance? 
What  is  the  per  cent  level  of  significance  of  a  deviation  t  =  rfc  3  or 
X  ^  db  3or? 
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EXERCISES 

1.  In  a  coin-tossing  experiment  in  which  a  coin  was  tossed  400  times, 
250  heads  appeared.  Do  you  believe  that  the  experiment  was  honestly 
performed? 

2.  Suppose  that  the  mortality  statistics  for  a  large  group  of  cities  show 
the  average  death  rate  from  tuberculosis  to  be  196.5  per  100,000  population, 
and  (7  =  14.  A  particular  city  showed  a  death  rate  from  tuberculosis  of 
110.3  per  100,000.  Is  this  surprising?  Another  city  (a  haven  for  tuber- 
culosis patients)  showed  a  death  rate  of  245  per  100,000  for  the  same 
disease.   Is  this  surprising  from  the  point  of  view  of  mere  chance? 

3.  A  coin  was  tossed  100  times.  Find,  using  the  normal  curve,  the 
probability  of  obtaining  exactly  60  heads. 

4.  In  a  college  the  12  grades  A+,  A,  A- ;  B  +  ,  B,  B-;  C+,  C,  C-; 
D,  E,  and  F  are  given.  On  the  assumption  that  ability  in  mathematics 
is  normally  distributed,  how  many  in  a  group  of  1,000  grades  should  re- 
ceive each  grade  mentioned?    Assume  that  the  total  range  is  ilf  ±  S.Oc. 

5.  {Thur stone)  Construct  three  frequency  curves  on  the  same  sheet 
according  to  the  following  specifications.  Indicate  an  ordinate  at  the  mid- 


Curve 

M 

N 

w 

A 

15 

50 

400 

10 

B 

15 

50 

800 

10 

C 

15 

50 

1,200 

10 

6.  Construct  three  frequency  curves  on  the  same  sheet  according  to 
the  following  specifications.   Compute  ordinates  for  each  half-sigma. 


Curve 

a 

M 

N 

w 

A 

5 

50 

1,000 

10 

B 

10 

50 

1,000 

10 

C 

15 

50 

1,000 

10 

7.  Draw  a  normal  curve  <l>{t)  and  divide  the  base  line  into  five  parts 
such  that  when  ordinates  are  erected  at  the  points  of  division  the  five  areas 
will  be  equal. 

8.  A  normal  distribution  has  the  following  constants:  N  =  1,000; 
w  =  b]  M  =  73.64;  a  =  8.3.  How  many  variates  are  between  Z  =  61 
and  X  =  94? 

9.  Determine  whether  it  is  expected  that  one  will  obtain: 

a.  2,048  heads  in  4,040  throws  of  a  coin. 

b.  3,300  heads  in  6,400  throws  of  a  coin. 

c.  38,024  appearances  of  a  four,  a  five,  or  a  six  in  78,000  throws  of  a 
single  die. 
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10.  Compute  the  ordinates  for  the  point  binomial  (J  +  and  com- 
pare them  with  the  ordinates  of  a  superimposed  normal  curve. 

11.  If  a  baseball  player  has  a  batting  average  of  0.300,  what  is  the 
probability  that  he  will  hit  safely  at  least  25  times  out  of  100  times  at  bat? 
Estimate  by  the  normal  curve.   Note  that  az  is  small. 

12.  If  16  coins  are  tossed,  what  is  the  probability  of  getting  5,  6,  7, 
8,  9,  10,  11,  or  12  heads?  (a)  Use  the  theorem  of  repeated  trials  for  a  result 
correct  to  two  decimals,  and  (b)  the  normal  curve  for  an  approximate 
result  to  two  decimals. 

13.  The  probability  of  a  man  of  age  56  dying  within  a  year  is  0.02. 
If  an  insurance  company  has  10,000  policies  in  force  on  men  of  this  age, 
find  the  probability  of  the  company ^s  having  to  pay  less  than  180  death 
claims;  more  than  220  death  claims.  Estimate  by  the  normal  curve. 
Note  that  az  is  small. 

14.  A  large  number  of  students  were  measured  as  to  height  and  for 
them  we  found  M  =  67.5  inches.  We  found  that  40  per  cent  of  the 
students  were  between  66.2  inches  and  68.8  inches  in  height.  What  is  the 
standard  deviation  of  the  heights? 

15.  In  the  United  States  in  1930,  12  per  cent  of  the  marriageable  men 
were  widowers.  Assume  this  situation  normal.  A  city  has  6,000  men  who 
are  marriageable  (single  men  15  years  old  and  over),  (a)  How  many  would 
you  expect  to  be  widowers?  Note  that  as  is  small,  (b)  Estimate  the 
probability  that  there  will  be  as  few  as  600  widowers,  (c)  As  many  as 
750  widowers. 

16.  The  experience  of  a  manufacturing  concern  has  been  that  in  the 
past  they  have  had  to  discard  5  per  cent  of  the  units  inspected  as  de- 
fective. A  sample  of  1,000  units  is  up  for  inspection,  (a)  How  many 
defective  units  would  you  expect?  (b)  What  are  the  values  at  the  5  per 
cent  level  of  significance? 

17.  In  1930,  about  9  per  cent  of  the  people  of  the  United  States  were 
*'20  and  under  25"  years  of  age.  In  a  typical  city  of  the  United  States 
of  population  10,000,  how  many  would  you  expect  to  find  between  20 
and  25  years  of  age?  Adopting  it  Sa  as  the  Umits  of  reasonable  chance 
occurrence,  would  you  be  surprised  to  find  as  few  as  800?  As  many 
as  1000? 

18.  (Waugh)  In  an  epidemic  of  infantile  paralysis  which  took  place 
in  the  eastern  part  of  the  United  States  in  the  fall  of  1931,  we  have  records 
on  927  children  who  contracted  the  disease.  Of  these,  408  received  no 
serum  and  104  of  the  408  became  paralyzed,  while  the  other  304  recovered 
without  paralysis.  If  the  serum  had  no  effect,  how  many  cases  would  you 
have  expected  among  the  519  who  were  given  serum?  (Assume  Sa  marks 
the  limit  of  reasonable  chance  occurrence.)  Actually  166  of  the  children 
receiving  serum  were  paralyzed.  What  do  you  conclude  as  to  the  efficacy 
of  the  serum?  What  other  factors  might  influence  the  result  besides  the 
effect  of  the  serum? 
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19.  A  group  of  1,000  students  took  an  objective  and  standardized  test. 
The  distribution  was  closely  normal  with  Af  =  60  and  <r  =  10.  What 
are  the  values  of  Qi,  Qz,  Q,  as,  a4,  and  the  87th  percentile? 

20.  It  has  been  established  that  of  children  under  one  year  of  age  who 
are  afflicted  with  whooping  cough  about  50.5  per  cent  recover.  A  hospital 
has  27  children  less  than  a  year  old  who  are  afflicted  with  this  disease. 
Establish  the  5  per  cent  level  of  significance  as  to  the  number  of  re- 
coveries and  state  carefully  what  you  have  found. 

21.  In  the  registration  area  of  the  United  States  in  1931,  51  per  cent 
of  the  births  were  males.  In  a  certain  city  in  1931,  100  babies  were  born, 
(a)  What  is  the  probabiUty  of  as  few  as  45  females?  (b)  As  many  as 
60  females?  (c)  What  is  the  probability  of  exactly  45  females?  (d)  What 
is  the  probability  of  exactly  60  females? 

106.   GRADUATION  OF  A   DISTRIBUTION  BY  THE 

NORMAL  CURVE 

In  this  book  we  have  frequently  called  attention  to  the  fact  that 
the  distributions  of  observed  data  that  we  have  analyzed  are  samples 
of  a  larger  population  or  universe.  It  has  been  pointed  out  that  the 
irregularities  of  the  distributions  may  be  due  to  a  paucity  of  the 
data  or  to  fluctuations  in  sampling.  The  frequency  curve  is  assumed 
to  represent  generalized  experience  of  data  of  a  given  type  on  the 
assumptions  (1)  that  has  been  greatly  increased  and  (2)  that  the 
class  intervals  have  been  indefinitely  diminished.  By  fitting  a  curve 
to  the  observed  data  we  have  opportunity  to  compare  observation 
with  idealization  and  to  note  the  variations  due  to  sampling. 

If  a  mound-shaped  frequency  distribution  is  reasonably  synmietri- 
cal,  the  normal  curve  may  approximately  represent  it.  Of  course  if 
a  distribution  is  decidedly  skew,  a  normal  curye  is  not  expected  to 
fit  the  data.  Our  problem  in  this  section  is  to  explain  the  steps  in 
determining  the  theoretical  frequencies  of  a  distribution,  assuming 
that  they  follow  a  normal  curve.  As  was  implied  in  the  derivation 
of  the  normal  curve  we  assume  that: 

1.  The  mean  and  the  standard  deviation  of  the  curve  are  equal  to  M 
and  (Tadj.  of  the  observed  data. 

2.  The  area  under  the  curve  equals  the  area  of  the  histogram. 

It  follows  from  the  first  assumption  that  the  first  step  in  fitting  a 
normal  curve  to  a  distribution  of  observed  data  is  to  compute  M 
and  aadj. 
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A.  Graduation  by  Ordinates.  The  following  table  of  the  gradua- 
tion of  the  distribution  of  the  heights  of  colored  soldiers  (see  p.  168) 

Table  93.  Graduation  by  the  Normal  Curve:  Ordinates 


X 

Observed 
fix) 

W 

t  =  — 

<i>(t) 

w) 

I  neoreixccii 
f(r\  =  18Q4  9tiKA^(h(£\ 

W 

148.5 

o 
2 

oo  0{\ 

O  A  A 

—  3.44 

001  1 

.0011 

2.1 

lou.o 

9 

oi  oo 

21. o9 

O  1  c 

3.15 

oooo 
.002o 

5.3 

1  CO  c 

152.5 

13 

1  o  oo 

19.39 

o  o  c 

2.85 

00/30 

.0069 

"1  O  1 

13.1 

154.5 

oo 
2o 

1  "7  OO 

17.o9 

o  ca 

2.5b 

01  C  1 

.0151 

oo  CI 

2o.O 

150.5 

5d 

1  c  oo 

15.o9 

2.2o 

001  o 

.0310 

5o.7 

1  CO  c 

15o.5 

oo 
oo 

1  o  oo 

1  0'7 

1.97 

OC70 

.0573 

1  oo  d 

lOo.o 

1dU.5 

1  CO 

lo2 

1  1  oo 

11. o9 

l.bo 

OOTO 

.0973 

1  OA  A 

lo4.4 

1  CIO  c 

0 1  o 

olo 

o  oo 

9.o9 

1  oo 

1.3o 

1  CAf\ 

.1540 

001  o 

291.0 

1d4.5 

4do 

T  oo 

7.39 

1  oo 

1.09 

oooo 

.2203 

A  t  I"?  A 

417.4 

1  cc  c 

loo. 5 

A 

5o4 

c  oo 

5.39 

O  TO 

0.79 

oooo 

.2920 

C  CO  o 

553.2 

1  CIO  c 

loo. 5 

D05 

o  oo 
3.39 

O  CO 

0.50 

O  C01 

.3521 

007.1 

17U.5 

70o 

1  oo 
—  1.39 

o  oo 

—  0.20 

001  A 

.3910 

'7Af\  O 

74U.O 

"1  TO  C 

172.5 

749 

1    o  oo 

+  0.09 

OOTO 

.3973 

•7  CO  T 

752.7 

1  '7  A  C 

174.5 

747 

O  CI 

2.ol 

0.3o 

OTI  O 

.3712 

rroo  Q 
7U3.3 

176.5 

586 

4.61 

0.68 

.3166 

699.8 

178.5 

469 

6.61 

0.97 

.2492 

472.1 

180.5 

314 

8.61 

1.27 

.1781 

337.4 

182.5 

207 

10.61 

1.56 

.1182 

224.0 

184.5 

133 

12.61 

1.85 

.0721 

136.6 

186.5 

70 

14.61 

2.15 

.0396 

75.0 

188.5 

38 

16.61 

2.44 

.0203 

38.5 

190.5 

22 

18.61 

2.74 

.0094 

17.8 

192.5 

15 

20.61 

3.03 

.0041 

7.8 

194.5 

10 

22.61 

3.33 

.0016 

3.0 

196.5 

3 

24.61 

3.62 

.0006 

l.I 

198.5 

2 

•  26.61 

3.91 

.0002 

0.4 

Total 

6,441 

6,440.6 

will  show  the  steps  in  the  process.  For  the  distribution  in  question 
we  have  previously  computed  M  =  171.89,  (Tad/.  =  (3.3996)2.  Apply- 
ing equation  (8),  the  theoretical  frequencies  are  given  by: 

y  =  2i^^(')  =  1894.6346<^.(0 


The  values  of  t  which  correspond  to  the  given  values  of  X  are  most 
easily  found  by  multiplying  x  by  l/cTod/.,  and  in  this  case: 
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=  0.147076 

adj. 

The  following  steps  are  recommended  as  the  proper  procedure  in 
fitting  a  normal  curve  by  ordinates. 

1.  Compute  M,  (Tadi.y  and  \/<Tadj. 

2.  Using  equation  (8),  write  the  equation  of  the  theoretical  frequen- 
cies. 

3.  Write  down  columns  (1)  and  (2),  giving  class  marks  and  frequencies, 
of  the  table  upon  which  the  computations  are  to  be  carried  out. 

4.  Compute  values  of  x  for  column  (3). 

5.  Compute  values  of  t  for  column  (4). 

6.  Write  down  values  of  <^(0  from  the  table  in  Appendix  B. 

7.  Compute  the  theoretical  frequencies  from  the  equation  found  in 
step  2. 

B.  Graduation  by  Areas.  The  graduation  of  a  distribution  by 
areas  depends  upon  a  few  notions  that  we  have  not  yet  sufficiently 
clarified.   Since  [see  page  406] 

^yTx\  =  =  Nw  ■  Ajl 

and  further,  since 

Nw  units  of  area 

represent  N  variates, 

then 

Nw  '  A^'y^^  units  of  area 

represent  N  •         variates.    X  scale  M      X^  X^ 

X  scale  0  X2 

We  shall  indicate  the  mcremen^    t  scale  0       ti  (2 

of  area  under         between  ti 

and  t2  by  AA.  The  theoretical  frequencies  will  then  be  computed  by 
N  .  A^. 

By  this  means  we  are  able  to  find  the  theoretical  frequencies  of 
the  various  classes  (to  which  the  incremental  areas  under  the  curve 
correspond)  and  compare  them  to  the  observed  frequencies  (to 
which  the  rectangular  areas  of  the  histogram  correspond).  That  is, 
we  compare,  for  example,  the  areas  X1ABX2  and  XiCDXzy  or  the 
frequencies  which  they  represent. 
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Table  94.  Graduation  by  the  Normal  Curve:  Areas 


Class 
lower 
HTnii.  ix 
(1) 

Observed 
fix) 

(2) 

7  ?l/f 
fx  — 

(3) 

I  — 

O'adj. 

(4) 

1 

(5) 

A  A 
(6) 

1  heorehcal 

iV    •  iLA^ 

(7) 

147  5 

2 

—  24.39 

—  3.59 

.0002 

•  V/  V/  V/  mm 

.0003 

1  9 

JL  «  %y 

149  5 

9 

22.39 

mm  mm  •  v/ 

3.29 

.0005 

.0008 

•  vy  vy  vy  v.*/ 

5  2 

vy  •  M 

151  5 

13 

20.39 

3.00 

.0013 

.0022 

14.2 

153  5 

23 

18.39 

2.70 

.0035 

.0045 

29  0 

155  5 

56 

16.39 

2.41 

.0080 

.0090 

58.0 

157  5 

88 

14.39 

2.12 

.0170 

.0174 

112.1 

JL   JL  MB  •  ^ 

159  5 

162 

-A.  \J  mm 

12.39 

1.82 

.0344 

.0286 

184.2 

161  5 

318 

10.39 

1.53 

.0630 

.0463 

298.2 

163.5 

468 

8.39 

1.23 

.1093 

.0643 

414.2 

165.5 

564 

6.39 

0.94 

.1736 

.0842 

542.3 

167  5 

665 

\J  V/ 

4.39 

0.65 

.2578 

.1054 

678.9 

169  5 

708 

2  39 

0  35 

3632 

1129 

•  X.  X  mmt  %J 

727  2 

171  5 

X  f  X  uKj 

749 

—  0  39 

—  0  06 

4761 

.1187 

764  5 

173  5 

747 

4-  1  61 

+  0.24 

.5948 

.1071 

•  -A-  Vy  f  JL 

689  8 

175  5 

586 

3  61 

0.53 

V/  «  vy  %^ 

.7019 

.0948 

•  V/  C/  JL 

610  6 

177  5 

XI  %  •%J 

469 

5.61 

0  83 

.7967 

•  f  V  \j  ff 

.0719 

463  1 

179.5 

314 

7.61 

1.12 

.8686 

.0521 

335.6 

181  5 

207 

9.61 

1.41 

.9207 

.0357 

229.9 

183  5 

133 

11.61 

^  A  •  JL 

1.71 

JL  •  f 

.9564 

.0209 

134.6 

JL          JL  •  \_/ 

185.5 

70 

13.61 

2.00 

.9773 

.0120 

77.3 

187.5 

38 

15.61 

2.30 

.9893 

.0059 

38.0 

189.5 

22 

17.61 

2.59 

.9952 

.0028 

18.0 

191.5 

15 

19.61 

2.88 

.9980 

.0013 

8.4 

193.5 

10 

21.61 

3.18 

.9993 

.0004 

2.6 

195.5 

3 

23.61 

3.47 

.9997 

.0002 

1.3 

197.5 

2 

25.61 

3.77 

.9999 

.00008 

0.5 

199.5 

0 

27.61 

4.06 

.99998 

.00000 

0.0 

Total 

6,441 

6,439.6 

We  shall  illustrate  the  procedure  by  graduating  the  distribution  of 
the  heights  of  colored  soldiers  (see  Table  94)  for  which  we  have  found: 


M  =  171.89,    aadj.  =  (3.3996)2,    and   —  =  .147070.^ 

0"  adj. 

*  The  question  that  naturally  presents  itself  to  the  thoughtful  student  at  this 
point  is:  What  is  the  criterion  to  determine  the  goodness  of  fit  of  a  theoretical 
curve  to  an  observed  distribution?  We  regret  that  the  answer  to  this  important 
question  takes  us  beyond  the  scope  of  this  text.  We  can  refer  the  reader  to  page 
78  of  Rietz  and  others,  Handbook  of  Mathematical  Statistics,  and  to  Karl  Pearson's 
Tables  for  Statisticians,  Pt.  I.  These  references  will  give  a  brief  discussion  of 
Pearson's  Chi-square  test.  For  fuller  information  we  refer  the  reader  to  Pearson's 
original  paper  in  Philosophical  Magazine,  Vol.  50,  ser.  5  (1900),  pp.  157-75. 
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In  the  graduation  of  a  distribution  by  the  normal  curve,  using 
areas,  we  shall  find  it  convenient  to  follow  the  following  steps. 

1.  Compute  Mj  Cadi.y  and  1/aadj. 

2.  Write  down  columns  (1)  and  (2)  of  the  table  giving  lower  class-limits 
and  frequencies.  Note  that  the  classes  are  defined  by  their  lower 
limits,  Ix. 

3.  Express  the  lower  limits  as  deviations  from  M:  Ix  —  M.  This 
gives  column  (3)  of  the  table. 

Ix  —  M 

4.  Express  the  deviations  from  M  in  units  of      t  =  ~  This 

gives  column  (4)  of  the  table. 

5.  Using  table  of  0(0  in  Appendix  B,  prepare  column  (5)  of  the  table: 

It  will  be  noted  that  the  desired  areas  are  found  by  subtracting  the 
values  in  the  table  from  0.5000  for  t  <  0,  and  by  adding  the  values 
in  the  table  to  0.5000  for  t  >  0. 

6.  By  subtracting  each  area  in  column  (5)  from  the  area  immediately 
beneath  it  we  compute  A^.   This  gives  column  (6). 

7.  Compute  the  theoretical  frequencies,      •  A^. 

EXERCISES 

1.  Graduate  by  ordinates  and  by  areas  the  distribution  of  chest  measure- 
ments which  is  given  in  Exercise  10,  page  168. 

2.  Graduate  the  distribution  of  the  heights  of  college  men  given  in 
(a)  of  Exercise  1,  page  54.   Use  areas. 

3.  Graduate  by  areas  the  distribution  of  the  head  breadths  given  in 
Exercise  2,  page  54. 

4.  Find  the  equation  of  the  distribution  of  pulse  beats  which  is  found 
in  Table  29  (p.  165),  assuming  normahty. 

5.  Plot  the  normal  curve  and  the  frequency  polygon  for  the  theoretical 
and  the  observed  distributions  given  in  Table  93.  Do  the  same  for  the 
distributions  in  Table  94. 

MISCELLANEOUS  EXERCISES 

1  (X  -  A/A')«  I 

1.  If  Yx  =  7=^       ^""'Y*    '  show  that  Yax  =  -jYx. 

ax^  27r  ^ 

2.  If  Yx  has  the  value  given  in  Exercise  1,  find  Yax  +  jj  in  terms  of  Yx. 

3.  Three  per  cent  of  all  children  are  left-handed.  In  a  group  of 
1,000  children  what  is  the  probabihty  that  as  few  as  20  will  be  left-handed? 
That  as  many  as  40  will  be  left-handed?  EstabHsh  the  number  of  children 
at  the  5  per  cent  level  of  significance. 
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4.  Based  upon  the  Mendelian  hypothesis,  it  is  expected  that,  on 
crossing  a  certain  type  of  pea,  25  per  cent  of  the  seeds  will  be  green.  An 
experiment  on  this  type  of  pea  gave  4,960  yellow  and  1,840  green  seeds. 
Is  the  divergence  within  the  5  per  cent  level  of  significance? 

5.  Show  that  the  frequency  curve 


is  symmetrical. 

6.  Plot  on  the  same  axes  frequency  curves  of  the  form  given  in  Exer- 
cise 5  when  (1)  a  =  5,  p  =  2;  (2)  a  =  5,  p  =  10;  (3)  a  =  5,  p  =  100. 
Assume  yo  =  100  in  each  case. 

7.  Show  that  as  a  and  p  increase  without  limit  but  in  such  a  way  that 
a/p  is  constant  and  equal  to  A'^,  the  curve  given  in  Exercise  6  approaches 
the  normal  form 

y  =  VoB  ' 

8.  We  replace  the  single  constant  a  in  Exercise  5  after  factoring  by 
ai  and  ai  thus  obtaining 


which  is  skew.  Plot  on  the  same  axes  this  curve  when  (1)  ai  =  4,  a2  =  5, 
p  =  10;  (2)  a\  —  10,  a2  =  5j  p  =  0.3.  Assume  yo  —  100. 

9.  Show  that  as  a2  increases  without  limit,  ai  remaining  constant, 
the  formula  in  Exercise  8  approaches  the  form 


10.  Draw  the  curve  in  Exercise  9  when  yo  —  25,  ai  =  12,  p  =  1.3. 


Chapter  i3 


THE  THEORY  OF   SAMPLING:  MEASURES 

OF  RELIABILITY 

107.  INTRODUCTION 

We  may  regard  the  numerical  description  of  any  mass  of  statistical 
data  from  two  points  of  view.  We  may  regard  the  description  as  an 
end  in  itself,  a  mere  summary  of  our  measurements,  or  we  may 
regard  it  as  a  sample  drawn  from  a  larger  group  which  we  call  the 
parent  population  or  the  universe. 

Usually,  the  larger  point  of  view  obtains,  that  of  forming  judgments 
of  the  universe  from  a  study  of  the  sample.  In  some  cases  it  is 
impossible  to  measure  the  entire  universe,  and  in  other  cases  it  is 
impracticable  to  do  so.  Even  if  such  a  goal  as  measuring  the  entire 
universe  was  possible  of  attainment,  the  added  expense  in  time  and 
labor  would  be  an  unnecessary  luxury.  For,  by  carefully  selecting  a 
sample,  excellent  estimates  of  the  statistical  parameters  of  the  uni- 
verse can  be  obtained. 

The  statistician  is,  therefore,  generally  forced  to  work  with  samples. 
We  compute  the  mean  of  the  sample  and  use  this  mean  as  a  basis  for 
estimating  the  mean  of  the  imiverse.  Similarly,  we  use  the  dispersion 
of  the  sample  as  a  basis  for  estimating  the  dispersion  of  the  universe; 
and  so  on.  Naturally,  we  must  then  attempt  to  state  the  degree 
of  confidence  we  can  attach  to  our  estimates.  This  we  do  in  terms 
of  probability. 

It  is  obvious  that  in  order  to  make  a  good  estimate  of  the  universe, 
we  must  have  a  good  sample,  a  representative  sample.  Securing 
such  a  sample  is  not  always  an  easy  task,  but  generally  it  can  be 
done.  The  procedures  employed  in  securing  such  samples  are  be- 
yond the  scope  of  this  book.  In  what  follows,  when  we  use  the  term 
sample,  we  mean  a  statistical  sample  wherein  any  one  individual  in 
the  parent  population  is  just  as  likely  to  be  included  as  any  other. 
Such  a  sample  is  often  called  a  random  sample, 
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This  process  of  generalizing  statistical  results,  of  making  inferences 
regarding  the  universe  from  the  study  of  the  sample,  is  called  sta- 
tistical induction.  Obviously,  it  is  a  problem  of  supreme  importance. 
Karl  Pearson  has  called  it  ''the  fundamental  problem  of  practical 
statistics." 

108.  THE  PROBLEM  OF  THIS  CHAPTER 

We  have  spent  no  little  time  in  the  preceding  chapters  with  ques- 
tions relating  to  the  numerical  description  of  a  mass  of  data  as  an 
end  in  itself.  We  have  seen  that  it  is  possible  to  describe  succinctly 
a  mass  of  numerical  data.  The  essence  of  the  data  may  be  condensed 
to  four  measures:  (1)  the  mean,  (2)  the  dispersion,  (3)  the  skewness, 
and  (4)  the  excess.  For  example,  given  the  measurements  of  the 
heights  of  1,000  men,  we  are  able  to  give  a  numerical  description  of 
the  1,000  measurements.  They  may  show  an  arithmetic  mean  of 
67.5  inches,  a  standard  deviation  of  2.5  inches,  a  coefficient  of  skew- 
ness, as,  of  0.036,  and  an  excess,  —  3,  of  0.123.  If  our  problem 
is  limited  to  a  characterization  of  the  1,000  measurements,  our 
problem  is  fairly  completely  solved.  In  characterizing  a  mass  of  data 
by  means  of  a  few  statistical  constants,  we  are  able  to  comprehend 
the  significant  facts  of  the  mass  which  might  not  otherwise  be  possible. 

If  we  adopt  the  second  and  broader  point  of  view  and  consider 
the  1,000  measurements  as  a  representative  sample  and  are  concerned 
with  using  the  properties  of  the  sample  in  order  to  make  inferences 
about  the  parent  population  from  which  the  sample  is  chosen,  it  is 
clear  that  we  cannot  speak  with  meticulous  certainty  concerning 
the  computed  statistical  constants  and,  as  a  consequence,  our  lan- 
guage should  be  modified.  Another  sample  of  1,000  measurements 
of  the  heights  of  men  chosen  in  a  similar  manner  will  probably  yield 
at  least  slightly  different  statistical  constants.  In  other  words,  these 
so-called  statistical  constants  show  variation  as  we  move  from  sample  to 
sample. 

While  the  statistical  constants  computed  from  successive  samples 
show  variation,  it  must  not  be  inferred  that  the  variation  is  unlimited. 
As  a  matter  of  fact  the  statistical  constants  computed  from  moder- 
ately large  random  samples  selected  from  a  larger  group  show  an 
uncanny  stability.    It  is  due  to  tWs  remarkable  and  measurable 
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stability  of  the  statistical  constants  computed  from  sample  to 
sample  that  we  may  make  inferences  from  a  relatively  small  set  of 
observed  data.  A  measure  of  the  stability  of  a  statistical  constant  is 
often  called  a  measure  of  its  reliability. 

The  so-called  statistical  constant  derived  from  the  analysis  of  a 
sample  is  frequently  called  by  some  writers,  following  R.  A.  Fisher, 
a  statistic^  and  the  corresponding  quantity  belonging  to  the  universe 
a  parameter.  A  statistic  is  thus  an  estimate  of  a  parameter.  For  a 
given  universe  a  parameter  is  fixed  but  the  statistic  may  vary  from 
sample  to  sample. 

It  will  be  the  problem  of  this  chapter  to  define  a  range  of  variation 
about  the  statistical  parameter  of  the  universe  within  which  fluctua- 
tions of  the  statistics^  due  to  pure  chance,  may  be  expected  to  occur 
according  to  definite  probabilities.  It  must  be  borne  in  mind  that 
the  variations  due  to  a  multiplicity  of  factors  other  than  pure  chance 
can  in  no  way  be  accounted  for  by  the  sampling  formulas  that  we 
shall  discuss.  The  variations  we  are  considering  ''are  the  resultant 
effect  of  a  complex  of  forces  which  cannot  be  traced,  still  less  measured, 
and  which  have  been  happily  described  as  that  *  mass  of  floating  causes 
generally  known  as  chance.'"  If  the  variations  are  greater  than 
can  be  accounted  for  by  chance,  the  significance  of  the  variation 
should,  if  possible,  be  accounted  for  and  explained  by  the  observer. 

We  may  meet  problems  that  fall  into  two  broad  categories.  In 
the  first  category  the  parent  universe  may  be  known  and  we  may  wish 
to  establish  whether  or  not  a  statistic  of  a  sample  falls  within  a  pre- 
determined range  of  variation.  (In  this  case  the  parent  universe 
is  generally  finite.)  For  example,  a  manufacturer  of  some  article 
may  have  examined  a  large  number  of  a  given  type  of  product, 
and  thus  may  have  been  able  to  adopt  rather  rigid  specifications  for 
the  product.  A  sample  is  selected  for  a  test.  Docs  the  sample  fall 
within  the  tolerance  limits  demanded  by  the  universe? 

In  the  second  category  the  parent  universe  is  unknown  and  we 
wish  to  estimate  its  parameters  by  finding  the  statistics  of  the 
sample,  and  to  measure  the  reliability  (or  degree  of  confidence)  we 
may  place  in  the  estimates.  By  far,  most  problems  that  occur  in 
the  applications  of  the  theorj^  of  sampling  belong  in  this  category. 
In  this  case  the  universe  is  generally  considered  as  infinite. 

In  most  cases  that  arise,  whether  the  universe  be  known  or  un- 


422 


THE  THEORY  OF  SAMPLING 


known,  stated  in  rather  general  terms  the  question  is:  How  well 
does  the  sample  describe  the  universe?  More  precisely:  How  much 
shall  we  allow  the  values  of  the  statistical  constants  obtained  from 
the  sample  to  vary  to  describe  the  parent  universe? 

109.  THE  STANDARD  DEVIATION  IN  CLASS  FREQUENCIES 

Table  95A  Table  95B 


(Parent  Population)  rr..  ^^^^^^^^^  ^'^^^  .  . 

^  ^  ^  ineoretical  1^  requencies) 


Class 

Fix) 

Class 

m 

1 

3,000 

1 

30 

2 

6,000 

2 

60 

3 

13,000 

3 

130 

4 

18,000 

4 

180 

5 

20,000 

5 

200 

6 

19,000 

6 

190 

7 

12,000 

7 

120 

8 

7,000 

8 

70 

9 

2,000 

9 

20 

Total 

100,000 

Total 

1,000 

Suppose  the  frequency  distribution  of  some  single  characteristic 
is  given  by  Table  95A.  The  relative  frequencies  of  the  several  classes 
are  3/100,  6/100,  13/100,  etc.  We  choose  from  this  homogeneous 
population  a  sample  of  1,000.  The  '^expected''  distribution  of  the 
sample,  by  Section  95,  would  be  that  given  by  Table  95B.  We  know 
of  course  from  experience  that  the  theoretically  ^'expected''  fre- 
quencies would  differ  from  those  that  would  result  from  experiment 
just  as  I  know  that  if  I  toss  a  coin  100  times  I  ''expect'^  50  heads 
and  50  tails  whereas  I  may  actually  get  48  heads  and  52  tails.  And 
from  my  experience  with  coin-tossing  experiments  I  am  not  shocked 
by  this  result. 

Suppose  that  we  should  obtain  a  large  number  of  samples  of  1,000 
observations,  each  taken  under  the  same  essential  conditions.  A 
class  frequency,  say  that  of  Class  3,  will  vary  from  sample  to  sample. 
These  values  will  form  a  frequency  distribution.  The  variations, 
called  "variations  due  to  sampling"  or  "variations  due  to  sampling 
errors,"  can  frequently  be  accounted  for  and  explained.    Such  a 


STANDARD   DEVIATION  IN  CLASS  FREQUENCIES  423 


question  as,  "What  is  the  variation  that  would  occur  in  Class  3  if 
we  obtained  a  large  number  of  samples  of  1,000  observations  from 
the  population  in  Table  95A?^'  we  can  answer  approximately. 

To  answer  this  question  we  consider  any  observation  as  a  trial, 
and  a  success  if  an  observation  falls  in  the  class.  Thus  the  proba- 
bility of  an  observation  falling  in  Class  3  is  p  =  13/100,  and  the 
probability  of  an  observation  not  falling  in  the  class  is  q  =  87/100. 
And  we  have  the  standard  deviation  of  the  frequency  of  this  class 
to  be  theoretically_Vl ,000(.  13)  (.87)  =  10.6.  So  that  we  should 
expect  Np  ±  SViVpg-  or  130  ±  32  observations  as  setting  the  limits 
of  the  frequency  of  Class  3  of  the  sample  of  1,000. 

If  the  probable  error  rather  than  the  standard  deviation  is  taken 
as  the  measure  of  the  variation,  then  the  probable  error  of  the  fre- 
quency of  Class  3  is  0.6745V^  or  0.6745(10.6)  =  7.1.  Hence,  if 
many  random  samples  of  1,000  observations  were  taken  from  the 
population  of  Table  A,  we  should  expect  theoretically  the  frequency 
of  Class  3  of  the  sample  to  fall  within  130  zt  7  about  half  the  time. 

If  plus  and  minus  three  times  the  standard  deviation  of  the  ex- 
pected frequency  be  taken  as  the  variation  in  the  frequency  that 
may  be  allowed  due  to  sampling,  then  if  many  samples  of  1,000  ob- 


Table  960  Table  96D 


Class 

fix) 

Class 

fix) 

1 

30  ±  3(5.4) 

1 

25 

2 

60  ±  3(7.5) 

2 

75 

3 

130  ±  3(10.6) 

3 

175 

4 

180  ±  3(12.1) 

4 

200 

5 

200  ±  3(12.6) 

5 

210 

6 

190  ±  3(12.4) 

6 

170 

7 

120  db  3(10.3) 

7 

80 

8 

70  ±  3(8.1) 

8 

50 

9 

20  ±  3(4.4) 

9 

15 

Total 

1,000 

Total 

1,000 

servations  are  actually  taken  from  the  population  of  Table  95A, 
we  might  obtain  frequency  distributions  with  the  variation  in  the 
class  frequencies  as  indicated  in  Table  96C.  So  that  if  we  were 
sampling  from  Table  95A  and  should  secure  a  sample  with  the 
frequencies  given  by  Table  96D,  we  would  be  inclined  to  suspect 
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that  randomness  went  awry  since  the  frequencies  in  Classes  3  and  7 
are  outside  the  limits  set  by  Table  96C. 

In  general,  if  the  frequency  of  the  fcth  class  of  the  parent  distribu- 
tion of  population  S  be  Fjt(x),  then  the  probability  of  an  observation's 

/  Fk(x)\ 

falling  in  that  class  is    (  =         )       the  probability  of  the  obser- 

/  Fk(x)\ 

vation's  not  falling  in  that  class  is  g J  =  1  SJ'  ^^^^^  ^ 

sample  of  N  is  chosen  the  expected  frequency  of  the  fcth  class  of  the 
sample  distribution  is  Npk  with  the  standard  deviation  \/Npkqk' 

In  applications  we  do  not  know  the  parent  population  and  hence 
the  true  value  of  Pk  is  unknown.  Let/^fx)  be  the  observed  frequency 
of  the  fcth  class  of  the  sample.  If  N  is  fairly  large  we  accept  fk(x)/N 
as  an  approximation  to  pk.   Then  we  have 


fkix) 
N 


Hence  the  frequency '  of  the  fcth  class  may  bo  written  with  its 
probable  error  as 


N 


This  means  that  if  a  sample  of  A''  is  taken  from  some  unknown 
parent  distribution,  the  chances  are  even  that  the  ol)served  frequency 
of  the,  fcth  class,  fk(x),  will  not  differ  from  the  expected  frequency 

of  the  fcth  class  by  more  than  d=  0.6745 0*^(0:)^  1  -  ^^^j' 

If  each  class  frequency  of  a  distribution  of  N  variates  is  divided 
by  Ny  we  obtain  a  distribution  of  relative  frequencies  or  percentages. 
As  a  corollary  to  the  theorem  for  finding  (T/^^iz)  we  can  immediately 
derive  a  formula  for  finding  the  variation  in  the  relative  frequency 
of  the  fcth  class y  (Tp^{x)y  where  pk{x)  =  fk{x)/N. 

From  Exercise  21  on  page  148  we  have  (Tax  =  Acx-  Employing 
this  theorem  we  have 

^  See  Rietz,  H.  L.,  Mathematical  Statistics,  pp.  119-122,  for  a  formula  which 
gives  a  closer  approximation. 
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.  4  /Pk(.x)qk{x) 


'  1  h{x) 

1 

N  N 

N  J 

where  qk{x)  =  I  —  pk{x)^ 

This  formula,  when  used  in  its  broad  meaning  to  measure  the 
variation  in  a  percentage,  is  usually  written 


n 

N 

where  g  =  1  —  p. 

Example.  Suppose  that  of  a  large  number  of  men  examined  for  military 
service  about  70  per  cent  have  been  accepted.  If  the  same  standards  are 
imposed  in  future  examinations,  what  are  the  limits  of  percentage  accept- 
ances expected  from  a  sample  of  1,000? 

Solution.   We  have  p  =  0.70    q  ==  0.30       =  1,000 


/(0.70 


70)  (0.30) 
^ — -  =  0.014  =  1.4  per  cent 


Adopting  d=  3(Tp  as  the  Umits  of  the  percentage  accepted,  we  should 
expect  the  percentage  accepted  to  vary  from  70  —  4.2  per  cent  to  70  +  4.2 
per  cent.  That  is,  we  should  expect  from  G5.8  to  74.2  per  cent  of  the  men 
examined  to  be  accepted. 


110.   AN  EXPERIMENT  IN  SAMPLING 


In  order  to  clarify  the  problem  of  the  sampling  process,  let  us  con- 
sider the  parent  universe  of  64  variates  distributed  according  to 
Table  97      ^^^^  point  binomial  64 (-g-  +  i)^-    Table  97  exhibits  the 

parent  distribution  in  tabular  form.    We  indicate  the 
mean  and  the  standard  deviation  of  the  universe  by 
Mu  and  cTu  respectively. 
For  this  universe  we  have: 

Mu  =  np  =  6(i)  =  3 


X 

fix) 

0 

1 

1 

6 

2 

15 

3 

20 

4 

15 

5 

6 

6 

1 

Total 

64 

a3  =  2 


Vnpq  =  V6(i)(i)  =  1.225 
■  P 


0 


In  order  that  we  may  draw  random  samples  from  the  given  parent 
population  we  prepare  64  cards  in  the  follomng  manner.   On  1  card 
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we  write  X  =  0  and  =  0;  on  6  cards  we  write  X  =  1  and  X^  =  I; 
on  15  cards  we  write  X  =  2  and  X^  =  4;  on  20  cards  we  write  Z  =  3 
and  =  9;  and  so  on  for  the  entire  parent  distribution.  We  now 
have  a  parent  population  of  64  members,  one  card  for  each  individual, 
from  which  we  may  draw  random  samples.  Suppose  we  draw  samples 
of  10  cards.  The  remaining  54  cards  constitute  a  sample  of  54  cards. 
With  each  drawing  we  therefore  obtain  samples  of  =  10  and 
N  =  54.  The  sum  of  X  for  all  64  cards  is  192  and  the  sum  of  X^ 
for  all  64  cards  is  672.  We  shuffle  the  cards  well  and  take  a  sample 
of  10  cards.  We  total  the  values  of  X  and  of  X'^  on  the  10  cards  and 
find  for  the  first  sample  of  10  that  SX  =  26  and  XX'  =  100.  For 
the  first  sample  of  10  we  now  find  M  =  2.6  and  cr  =  1.8.  We  thus 
have  one  sample  mean  and  one  sample  standard  deviation  for 
N  =  10.  For  iV  =  54  we  also  have  SX  =  192  -  26  =  166  and 
SX2  =  672  -  100  =  572,  from  which  we  compute  M  =  3.1  and 
a  =  0.99.  We  thus  have  for  =  54  one  sample  mean  and  one 
sample  standard  deviation.  We  place  the  cards  again  on  the  pack, 
shufHe  them  well  again,  and  draw  10  cards,  from  which  we  again 
compute  the  sample  means  and  the  sample  standard  deviations. 
We  can  continue  this  process  and  select  as  many  samples  as  we 
please.  Obviously  64C'io  distinct  samples  can  be  secured.  We  show 
below  the  distributions  of  100  actual  sample  means  for  the  case  in 
which  =  10  and  the  case  in  which  =  54.  We  denote  by  Z  any 
sample  mean  and  its  frequency  hy  f{z).    (See  Table  98.) 

Distribution  (a),  which  has  100  sample  means,  was  derived  by 
drawing  samples  of  10  variates  from  the  previously  described  parent 
population  of  64  variates  and  computing  the  means  of  the  samples 
drawn.  Distribution  (b),  with  samples  of  54  variates,  was  similarly 
derived.  Each  distribution  is  therefore  a  distribution  of  sample 
means  that  has  its  mean  (the  mean  of  the  means),  its  standard  devia- 
tion (the  standard  deviation  of  the  means),  its  skewness  (the  skewness 
of  the  means),  and  so  on.  It  is  the  standard  deviation  of  the  means 
in  which  we  are  especially  interested,  for  it  gives  a  measure  of  the 
variability  of  the  distribution  of  means. 

We  shall  leave  it  as  an  exercise  for  the  student  to  verify  the  follow- 
ing values:  A^  =  10  A^  =  54 


Mz  =  2.997 
(Tz  =  0.298 


Mz  =  3.00 
(Tz  =  0.078 
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Table  98 


(a) 


(b) 


AT  =  10 


=  54 


MJ 

2  2 

1 

3  15 

1 
J. 

2  3 

1 

3  13 

X 

2  4 

2 

3  11 

2 

2  5 

3 

3  09 

Q 

2  6 

5 

3  07 

li 

2  7 

6 

3  06 

A 

2.8 

9 

3.04 

9 

2.9 

14 

3.02 

14 

3.0 

20 

3.00 

20 

3.1 

13 

2.98 

13 

3.2 

8 

2.96 

8 

3.3 

7 

2.94 

7 

3.4 

4 

2.93 

4 

3.5 

3 

2.91 

3 

3.6 

1 

2.89 

1 

3.7 

2 

2.87 

2 

3.8 

1 

2.85 

1 

Total 

100 

Total 

100 

*  These  are  rounded  values. 

Figure  59  represents  the  curve  for  the  parent  distribution  and  Fig- 
ure 60  the  ordinates  of  the  distribution  of  sample  means  for     =  10. 

Figure  59 

20  \i{x) 


15- 


10- 


^  Curve  of  the  Parent 
g^i  Distribution 
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Figure  60 


15  ' 


Ordinates  of  the  Distribution 
of  Sample  Means  for  N=^10 
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0 
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1 


2 


3 


5 


6 


It  will  be  observed  that  the  sample  means  are  approximately 
normally  distributed  above  and  below  Mz  =  2.997,  but  with  a  dis- 
persion much  less  than  that  of  the  parent  population.  In  the  next 
section  we  shall  derive  some  theorems  that  should  explain  these 
phenomena. 

The  following  exercises  are  given  primarily  to  prepare  the  student 
for  a  facile  reading  of  the  succeeding  section.  The  various  numbers 
should  therefore  be  solved  in  detail. 


1.  Consider  the  parent  population  of  5  variates:  Xi,  Xj,  X3,  Z4,  Xi,, 
Write  down  the  10  distinct  samples  of  3  variates  that  may  be  drawn. 
For  example,  Xi,  Z2,  Xz]  Xi,  X2,  X4. 

2.  Let  Zi  represent  the  ith  sample  mean  and  write  down  the  10  distinct 
sample  means  for  the  parent  population  in  Exercise  1.   For  example, 


EXERCISES 


3 
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3.  Show  that  for  the  sample  means  in  Exercise  2: 

State  in  words  the  theorem  of  this  formula. 

4.  Show  that: 

where 

2/X{  =  Xi  +  Xa  +  X3  +  X4  +  Xj 

6*  For  the  values  of  Z<  found  in  Exercise  2  show  thai: 

2Z5  1 

~Yq~  =     [2X{  +  2X<X/] 

6.  Using  the  relationship  in  Exercise  4  show  that: 

1 

-^  =  -[SXj  +(sx,)^ 


4  /2^<  2 

7.  From  equation  (7)  of  Chapter  IV  we  have  gz  =  y  -^^^ —  Af^. 

Use  this  relationship  and  those  established  in  Exercises  3  and  6  above  to 
show  that,  for  the  distribution  of  means  here  considered: 

V6 


111.   THE  DISTRIBUTION  OF  MEANS 

Let  us  now  consider  the  general  problem  of  characterizing  the  dis- 
tribution of  sample  means  derived  by  drawing  samples  of  variates 
from  a  parent  population  of  S  variates.  Obviously  sCn  distinct 
samples  may  be  drawn.  Each  sample  has  its  mean  and  the  sCjsr 
samples  give  us  a  distribution  of  sample  means. 

We  shall  undertake  to  characterize  this  distribution  of  means  as 
we  should  characterize  any  distribution,  that  is,  by  finding  its  mean, 
its  standard  deviation,  its  skewness,  and  so  on. 

A.  The  Mean  of  the  Means.  Let  the  parent  universe  be  de- 
noted by  Xi,  X2,  X3,  .  .  . ,  X5.  Denoting  any  sample  mean  by  Zt,  we 
have: 
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1 


•  •  • 


+  Xn-i  +  Xn] 


-22  =  -^  [^1  +  X2  +  '  '  '  +  Xn-1  +  Xn+i] 


+  Xs-i  +  Xs] 


(1) 


We  desire  to  find  the  mean  of  this  distribution  of  sample  means. 
We  must  find  SZ  -r-  sCn-  Note  that  each  parenthesis  contains  N 
terms  and  that  the  sCn  hnes  contain  {N  •  sCn)  terms  which  are  not 
all  distinct.   One  X  occurs  as  frequently  as  another.   Hence  each  of 

/N  \ 

the  S  X^s  occurs  (  ^  •  sCn)  times.   That  is: 


-  sCn  =  pXi  =  Mx  (2) 


We  may  express  this  important  result  as  follows: 

Theorem:  The  mean  of  the  sCn  sample  means  formed  by  selecting 
samples  of  N  variates  from  a  parent  population  of  S  variates  is  equal  to 
the  mean  of  the  S  variates. 

Stated  less  accurately,  we  may  say  that  the  mean  of  the  distribu- 
tion of  means  is  equal  to  the  mean  of  the  parent  universe:  Mm  = 

B.  The  Standard  Deviation  of  the  Means.  We  shall  now  proceed 
to  find  the  standard  deviation  of  the  sCn  sample  means  with  which 
we  can  measure  the  variability  of  the  distribution  of  means.  We 
should  recall  that  the  standard  deviation  of  the  parent  universe  is 
given  by 


'XT' 


ox 


S 


Ml 


and  that  the  standard  deviation  of  the  distribution  of  sample  means 
is  given  by:   

(3) 


\  sCn 


Ml 


Since  Mz  is  known,  (Tz  can  be  determined  if  we  can  find  SZ*. 
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From  equations  (1)  we  have: 

1  2 

[^+-X^2+ •  •  '+Xn]+j;pZXiX2+XiXz+'  '  '  +Xn-\Xn] 

^2  =        f-"^^  +        +  •  •  •  +  Xn-I  +  Xn+\]  +        [XiXi  +  XiXz 

+  •  •  •  +Xn-^iXn+i] 


^sCn  =  ^  [Zl-_Ar+l  +  .   •   .  +  Zj]  +  -—^  \^Xs-N+lXs-N-^2  +  •   •  • 

+  Xs-iXs] 

To  find  the  sum  of  these  sCn  squared  means  we  note  that  the 
sum  of  the  parentheses  containing  terms  of  the  type  Xl  maybe  found 
as  follows:  Each  parenthesis  contains  terms  and  the  sCn  paren- 
theses contain  (A^  •  sCn)  terms  which  are  not  all  distinct.  One 
occurs  as  frequently  as  another.  Hence  each  of  the  given  S  X^^s 
occurs  (A^  •  sCn     S)  times. 

To  sum  the  parentheses  containing  the  cross-product  terms  of  the 
type  XtXj,  note  that  each  parenthesis  contains  terms  and  the 
sCn  parentheses  contain  {^€2  •  sCn)  terms  which  are  not  all  dis- 
tinct. One  cross-product  term  occurs  as  frequently  as  another. 
Since  we  can  get  5(^2  cross-product  terms  from  the  given  aS  letters, 
each  of  the  (^€2  •  sCn)  cross-product  terms  must  occur  {^€2  •  sCn 
-r-  SC2)  times.  Therefore: 


and 


Since 

we  have: 
Hence: 


?^  _   'S  -  iV  SCAT  -  1)/S^» 

sCiv  ~  N{S-l)  S  ^  N{S 


-  1)(ZXV 

-i)\s) 
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Substituting  this  in  (3),  recalling  from  (2)  that  Mz  =  Mx,  we  have 
upon  simplifying : 


4  /  S-JV 


Since,  in  general,  S  is  very  large  when  compared  with  N,  we  can 
obtain  a  simpler  relationship  if  we  assume  that  S  is  infinite.  For 
tliis  case  we  obtain  ^ 

in  which,  we  repeat  for  emphasis,  N  is  the  number  of  variates  in  the 
sample  and  <Tx  is  the  standard  deviation  of  the  parent  universe. 
In  fact,  in  (4)  and  (5)  we  may  replace  <Tz  and  ctx  by  cr^  and  cr^.  As 
the  constants  describing  the  parent  universe  are  usually  not  known, 
formulas  (4)  and  (5)  are  apparently  of  value  only  theoretically. 
Since  we  have  stated  that  it  is  our  problem  to  make  certain  inferences 
about  the  parent  universe  from  a  consideration  of  the  sample  we  shall 
see  in  a  later  section  how  (5)  will  assist  us  in  doing  it.  Experiment 
justifies  our  making  the  assumption  that  the  standard  deviation  of  the 
parent  universe  is  approximately  equal  to  the  standard  deviation  of  the 
sample f  the  goodness  of  the  approximation  increasing  as  N  is  increased. 
This  assumption  makes  possible  our  expressing  the  formula  for  the 
standard  deviation  of  the  mean  in  a  workable  form.  We  have, 
finally 

<^M  =        (approximately)  (6) 

where  M  is  the  mean  of  the  sample,  a-  is  the  standard  deviation  of  the 
sample,  and  N  is  the  number  of  variates  in  the  sample.  That  is: 

the  standard  deviation  _  the  standard  deviation  of  the  sample 
of  the  arithmetic  mean  Vthe  number  in  the  sample 

C.  The  Probable  Error  of  the  Mean.  In  Chapter  4  we  insisted 
that  the  standard  deviation  is  an  excellent  measure  of  the  variability 
of  a  distribution.  We  also  insist  that  the  standard  deviation  of  the 

^  This  is  easily  seen  if  we  divide  numerator  and  denominator  of  the  quantity 
under  the  radical  by  S  and  note  that  N/S  and  l/S  approach  zero  as  8  becomes 
infinite. 
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mean  as  computed  from  (6)  is  an  excellent  measure  of  the  variability 
of  the  distribution  of  means.  Tradition,  however,  has  been  a  potent 
influence  in  commending  the  use  of  the  probable  error  to  measure 
the  variation  in  the  several  statistical  constants.  In  the  preceding 
chapter,  we  defined  the  probable  error  of  any  measure  by  the  equa- 
tion: 

Ex  =  0.6745(7x 

Therefore,  the  probable  error  of  the  mean  is  defined  by  the  relation: 

0.6746<r 


Em  =  0.67450-3^  =  (7) 


The  quantities,  (Tm  and  Euy  are  frequently  used  as  measures  of 
the  reliability  of  the  arithmetic  mean.  Since  the  smaller  the  variation, 
the  greater  the  reliability,  a  small  standard  deviation  of  the  mean 
or  a  small  probable  error  of  the  mean  means  *' accurate  shooting." 
It  is  therefore  evident  from  (6)  and  (7)  that  the  smaller  the  (Tm  or  Emj 
the  greater  the  reliability  in  M, 

The  language  of  variation  used  in  the  preceding  paragraph  is 
inverse.  We  can  make  the  variation  direct  if  we  adopt  the  measure, 
hj  as  is  done  in  the  theory  of  errors,  for  the  index  of  precision  where  h 
is  defined  by  the  equation  (see  Section  102,  p.  399) : 

hx  =  -  ^ 


For  the  distribution  of  means  we  have 


-  it/f 


as  the  index  of  precision  of  the  mean.  It  will  be  observed  from  (6), 
(7),  and  (8)  that  the  reliability  of  the  mean  or  the  precision  of  the 
mean  varies  as  the  square  root  of  the  number  in  the  sample.  That 
is,  the  greater  the  number  in  the  sample,  the  greater  the  reliability 
in  the  mean.  For  example,  to  double  the  reliability,  we  must  quad- 
ruple the  frequency. 

It  is  not  customary,  however,  in  elementary  statistics,  to  use 
/i Af  as  the  measure  of  the  reliability  of  the  mean.  Rather  do  the 
workers  in  applied  statistics  prefer  <Tm  or  Em*  In  fact,  it  is  the  custom 
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(see  Section  37,  p.  143)  to  write  the  probable  error  of  the  mean  im- 
mediately after  the  computed  mean  of  the  sample  with  a  rt  sign 
between  them,  thus: 

M„  =  M  ±  £m  (9) 

For  example,  suppose  a  sample  distribution  of  the  heights  of 
1,000  men  shows  an  arithmetic  mean  of  67.5  inches  and  a  standard 
deviation  of  2.5  inches.  Then: 

Em  =  0.6745  =  0.053  inch 

VlOOO 

and 

Mu  =  67.5  ±  0.053  inches 

Since  the  distribution  of  sample  means  collected  from  a  normal  parent 
population  is  itself  normal,  this  means  simply  that  if  a  large  number 
of  the  means  of  samples  of  the  heights  of  1,000  men  were  collected, 
half  of  the  sample  means  would  be  within  0.053  inch  of  the  mean 
of  the  universe  Mu(  =  Mm)-  Since  Mm  ±  3<Tm  or  Mm  ±  4:,5Em  in- 
cludes nearly  all  the  sample  means,  it  is  practically  certain  that  no 
sample  mean  will  differ  from  the  mean  of  the  universe  Mu  by  more 
than  ±  4.5(0.053)  inches. 

It  should  be  emphasized  that  the  expression  Mu  =  Af  ±  Em  is 
not  to  be  interpreted  as  stating  that  the  true  mean  of  the  universe  is 
somewhere  between  M  —  Em  and  M  +  Em]  nor  is  it  to  be  inter- 
preted as  stating  that  the  true  mean  probably  differs  from  the  com- 
puted sample  mean  by  the  amount  Em-  It  means  that,  so  far  as 
variation  due  to  pure  chance  is  concerned,  the  odds  arc  even  that  a 
sample  mean  M  will  not  differ  from  the  mean  of  the  universe  Mu  by 
more  than  Em- 

If  we  were  to  write  the  arithmetic  mean  of  the  universe  Mu  in  the 
form 

this  would  signify  that  the  odds  are  about  2  to  1  that  a  sample  mean 
M  will  not  differ  from  Mu  by  more  than  (Tm-  It  does  not  mean  that 
the  odds  are  2  to  1  that  Mu  is  within  the  interval  whose  end  values 
are  M  —  Cm  and  M  +  The  probability  pertains  to  the  limits 
of  the  range  embracing  Mu.  We  do  not  state  the  probability  of  Mu 
lying  within  these  limits  for  Mu  is  fixed.  Thus,  for  the  heights  of 
the  sample  of  1,000  men  noted  above  we  have 
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(Tm  =  ?^ —  =  o  i^^oo  =  0.08  inch 
Vl^OOO  31.623 

And  we  say  that  the  odds  are  2  to  1  that  the  mean  of  the  sample, 
67.5  inches,  is  within  0.08  inches  of  the  mean  of  the  universe.  The 
odds  are  95  to  5  (or  19  to  1)  that  the  mean  of  the  sample  does  not 
differ  from  the  mean  of  the  universe  by  more  than  =b  1.96(.08)  inches, 
and  the  odds  are  99  to  1  that  the  sample  mean  does  not  differ  from 
the  mean  of  the  universe  by  more  than  ±  2.56(.08)  inches. 

It  has  thus  become  customary  to  write       in  two  different  forms: 
Mu  =  M  ±  Em  and  Mu  =  M  ±  (Sm>    In  the  first  case  we  have 
with  a  probable  error  of  Em^^  and  in  the  second  case  we  have 
**  M  with  a  standard  error  of  aM-^^  To  avoid  ambiguity  the  statistician 
should  state  definitely  what  his  symbols  mean. 


ILLUSTRATIVE  EXAMPLES 

Example  1.  A  corporation  which  sells  a  large  number  of  automobile 
tires  gathered  data  on  the  mileage  obtained  from  a  given  type  of  tire. 
A  large  group  of  100,000  usei's  were  questioned,  and  the  data  analyzed. 
For  this  universe  of  *S  =  100,000  it  was  found  that  M«  =  21,000  miles 
and  (7u  =  2,000  miles. 

At  a  later  time  in  order  to  compare  the  quality  of  the  product,  the 
corporation  secured  data  from  1,000  users  of  the  same  type  of  tire.  For 
this  sample  of  N  =  1,000  it  was  found  that  M  =  20,960  and  a  =  1,980 
miles. 

Was  the  corporation  correct  in  concluding  that  the  quahty  of  the  tire 
was  not  impaired,  or  that  the  variation  of  M  from  M„  was  not  significant? 
Solution.  Translating  formula  (4)  into  better  symbols,  we  have 


=  -\/^y  =  2,000 


100,000  ~  1,000 


000(100,000  -  1) 

=  62.6  miles  =  63  miles  (rounded). 

Thus  for  the  distribution  of  means,  which  is  normal,  we  have  Mm  =  M« 
=  21,000  miles  and  o-ji/  =  63  miles. 

If  many  such  samples  were  taken  we  could  expect  68.27  per  cent,  or  about 
two-thirds,  of  the  means  to  fall  within  the  interval  M m  ±  ctm-  That  is, 
we  should  expect  about  two-thirds  of  the  sample  means  to  fall  in  the  in- 
terval 21,000  ±  63  miles,  or  between  20,937  and  21,063.  Since  20,960  is 
within  this  interval,  we  conclude  that  the  quality  of  the  tire  is  not  impaired 
and  that  the  difference  is  not  statistically  significant. 
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0.2643 


0.7557 

Curve  of  Means 


3  t  scale 
189  m  scale 


We  can  look  at  the  problem  from  another  point  of  view, 
the  divergence  Af  —  Af^  in  standard  units.   We  find 

M  -  Mm  ^  20,960  -  21,000  ^  _  40 

"63 


We  express 


t  = 


m 


63 


=  -      =  -  .63 


Looking  up  the  probability  table  we  find 

=  .5000  -  A^^l  =  .5000  ~  .2357  =  .2643 

We  would  therefore  expect  26  per  cent  of  the  sample  means  to  be  less 
than  20,960  miles  and  74  per  cent  to  be  greater  than  20,960  miles.  In 
other  words  the  probability  of  a  sample  mean  being  less  in  value  than 
20,960  is  26/100  or  13/50  and  greater  than  20,960  is  74/100  or  37/50. 

Of  course  we  can  base  our  argument  on  the  probable  error  of  the  mean 
instead  of  the  standard  error  of  the  mean.  We  find 

Em  =  .6745(rM  =  .6745(62.6)  =  42  miles 

Then  we  can  state  that  the  chances  are  even  that  a  sample  mean  will 
lie  in  the  interval  21,000  rh  42  or  between  20,958  and  21,042.  The  given 
mean  20,960  is  within  this  interval,  and  such  a  small  divergence  as  40/42 
probable  errors  from  Afjvf  is  certainly  within  the  tolerance  limits  of  the 
most  scrupulous. 

Example  2.  Suppose  in  the  previous  example  we  use  formula  (5) 


Vat 


Will  our  results  be  affected? 
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Solution.  This  means,  writing  formula  (4)  in  the  form 


AT  1 

S  is  so  large  compared  to  N  that  we  may  consider     and     as  negligible. 

o  o 

We  find  for  the  data  at  hand 

2,000       2,000  .1    /       1  IX 

o-Af  =  =  Trrrr^r  =  "3.2  =  63  miles  (rounded) 

Vl,000  31.63 

This  approximation  certainly  will  in  no  way  alter  our  previous  con- 
clusion. 

Example  3.    If  in  Example  1  we  use  formula  (6)  for  computing  o-jvf, 
will  our  conclusion  be  altered? 
Solution: 

cr         1,980        1,980  a      «9     i     /       a  ^^ 

=        =  — =  -— —  =  62.6  =  63  miles  (rounded) 
VN     V  1,000  31.63 

And  this  approximation  will  also  in  no  way  alter  our  conclusion. 

Example  4.  The  blood  pressure  of  10,000  young  men  of  given  age  was 
measured  and  recorded.  The  analysis  of  the  sample  gave  M  =  122  and 
a  =  9.  Find  the  standard  error  and  the  probable  error  of  the  mean,  and 
interpret  them.   What  is  the  5  per  cent  level  of  significance? 

Solution.  In  this  case  we  do  not  know  the  statistics  of  the  universe. 
Our  information  about  the  statistics  of  the  universe  must  be  inferred 
from  the  statistics  of  the  sample.  The  mean  of  the  sample  is  an  estimate 
of  the  mean  of  the  universe.  How  reliable  is  the  estimate? 

We  compute  the  dispersion  of  the  sample  means  by  (6).  We  have 

0-  9 
<tm  =  —7=  =    ,  =  0.09 

VN     V  10,000 

which  indicates  the  dispersion  of  the  sample  means  about  the  universe 
mean  Mw  The  universe  mean  is  unknown.  However,  we  can  state  that 
the  odds  are  2  to  1  that  the  sample  mean  122  does  not  differ  numerically 
from  Mu  by  more  than  0.09.  Since  99.74  per  cent  of  the  sample  means 
vary  from  M„  by  not  more  than  zt  3o'jif(=  0.27),  we  may  conclude  that 
the  odds  are  99.74  to  0.26,  or  about  385  to  1,  that  the  sample  mean  122 
does  not  differ  numerically  from  Mu  by  more  than  0.27. 

Em  =  0.6745(Tm  =  .6745(0.09)  =  0.06 

which  indicates  that  the  chances  are  even  that  the  sample  mean  122  does 
not  differ  numerically  from  Mu  by  more  than  0.06. 
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The  5  per  cent  level  of  significance  is  at  rt  1M(Tm  or  at  =b  1.96(.09) 
=  d=  0.176.  That  is,  the  odds  are  95  to  5,  or  19  to  1,  that  the  sample 
mean  122  does  not  differ  numerically  from  Mu  by  more  than  0.176.  In 
other  words,  if  many  samples  of  10,000  were  measured  and  their  means 
computed,  we  should  expect  95  per  cent  of  the  means  to  lie  within  the 
interval  Mu  -  0.176  and  Mu  +  0.176. 


EXERCISES 

1.  Professor  Sorenson  ^  records  an  experiment  in  which  fifty  samples 
of  fifty  men  each  were  taken  from  American  Men  of  Science  and  the  mean 
age  of  each  sample  computed.  The  means  ranged  in  value  from  41.40 
years  to  51.14  years.  He  found  that  this  distribution  of  means  was  normal 
with  Mm  —  46.34  years  and  cjvf  =  2.30  years. 

a.  What  is  the  estimated  mean  age  of  the  2,500  men?  Dr.  Sorenson 
gives  46.34  years  as  the  computed  mean  age  of  the  2,500  men. 

b.  How  many  of  the  means  would  you  expect  to  find  between  Mm  —  (^m 
and  Mm  +  o-m?        Sorenson  found  34,  or  68  per  cent  of  them. 

c.  Compute  Em-  What  does  it  mean? 

d.  How  many  of  the  means  would  you  expect  to  find  between  Mm 
—  Em  and  Mm  +  Em"^  Dr.  Sorenson  found  25,  or  50  per  cent  of  them. 

e.  Consider  the  2,500  ages  as  a  sample  of  all  the  men,  22,000,  whose 
names  appeared  in  the  book.  For  this  large  sample  Dr.  Sorenson  gives 
M  =  46.34  and  a  =  12.46.  Compute  Em  and  interpret  it  with  regard 
to  the  average  age  of  all  men  in  the  book. 

2.  A  sample  of  N  =^  625  gave  Em  =  0.27.  What  size  sample  would  be 
required  to  give  Em  =  0.09?  0.045? 

3.  A  distribution  of  the  weights  at  birth  of  a  sample  of  402  infants 
gave  M  =  7.29  pounds  and  <r  =  1.006  pounds.  Compute  Em  and  inter- 
pret it. 

4.  A  study  of  ''red  blood  cell  count"  for  40  normal  men  gave  M  =  4.973 
millions  per  cu.  mm.  and  <x  =  0.332  millions  per  cu.  mm.  Find  ctm  and 
interpret  it. 

5.  For  a  group  of  1,000  college  students  the  mean  height  was  68.2 
inches  and  the  standard  deviation  was  2.5  inches,  (a)  Find  the  probability 
that  in  a  sample  of  100,  the  mean  height  will  be  between  67.82  and  68.78 
inches,  (b)  Find  the  probability  that  in  a  sample  of  100,  the  mean  will 
be  greater  than  68.9  inches. 

6.  Consider  the  table  of  the  weights  of  men  found  on  page  140.  Does 
the  difference  between  each  sample  mean  and  the  universe  mean  lie  within 
the  5  per  cent  level  of  significance? 

7.  Consider  the  table  of  the  heights  of  men  found  on  page  141.  Does 

*  Herbert  Sorenson:  Statistics  for  Students  of  Psychology  and  Educationt 

►x»v  one  onf\ 
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the  difference  between  each  sample  mean  and  the  universe  mean  lie 
within  the  5  per  cent  level  of  significance? 

8.  Using  the  probable  error  notation,  E.  S.  Pearson  gave  the  mean 
length  of  cubit  for  1,063  British  males  as  18.31  ±  0.019  inches.  Show 
that  <T  —  0.92  inch.  Does  this  mean  that,  assuming  a  normal  distribution, 
about  709  of  these  men  had  cubit  lengths  between  17.39  and  19.23  inches? 

9.  If  the  statement  is  made  that  the  mean  height  of  1,000  men  is 
68.78  ±  0.046  inches,  can  you  adduce  evidence  that  0.046  is  Em  and  not  o-m? 

10.  (Freeman)  Two  engineers  made  1,306  readings  during  a  5-year 
period  on  the  heat  value  in  Btu.  of  a  mixed  gas.  The  distribution,  ap- 
proximately normal,  gave  Mu  =  534.99  Btu.  and  au  =  3.85  Btu.  On 
64  days  at  irregular  intervals,  state  inspection  was  conducted  and  the 
mean  of  the  approximately  normal  sample  was  536.72  Btu.  Would  you 
say  that  the  64  measures  constituted  a  random  sample? 

11.  The  breaking  strength  of  a  certain  type  of  cord  has  been  established 
from  considerable  experience  to  be  18.3  ounces  with  a  standard  deviation 
of  1.2  ounces.  A  sample  of  100  pieces  of  the  same  type  of  cord  shows  a 
mean  breaking  strength  of  16.5  ounces.  Would  you  say  that  the  sample 
is  inferior? 

12.  After  observing  a  large  number  of  cases  it  has  been  established  that 
a  certain  disease  is  10  per  cent  fatal.  The  hospital  of  the  Good  Shepherd 
found  that  during  the  period  1937-1942,  of  100  patients  admitted  with  the 
disease  12  died.    May  this  difference  be  attributed  to  chance?  Hint: 

(Tq  =  V  pq/N, 

13.  At  Bucknell  University  the  freshmen  who  take  College  Algebra 
are  previously  screened  by  a  placement  test.  Our  records  covering  a 
period  of  years  reveal  that  about  16  per  cent  fail  the  course.  During  the 
fall  semester,  1942,  of  400  freshmen  enrolled  in  College  Algebra  20  per  cent 
failed.  Adopting  the  5  per  cent  level  as  a  basis  for  judgment,  would 
you  say  this  difference  is  significant? 


D.  The  Skewness  and  Excess  of  the  Distribution  of  Means.  We 

have  derived  formulas  for  the  arithmetic  mean  and  the  standard 
deviation  of  the  distribution  of  sample  means  given  by  (1).  In  order 
to  characterize  more  completely  the  distribution,  we  should  derive 
formulas  for  the  skewness  and  the  excess.  In  this  section  we  shall 
give  an  abridged  derivation  for  the  skewness,  leaving  the  details 
for  the  reader  to  work  out,  and  shall  give  without  proof  the  formulas 
for  the  excess.  (See  Exercise  9  at  end  of  this  chapter.) 
The  skewness  for  the  distribution  of  means  is  given  by: 


0C3,Z  = 
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where 

»'».z  =  ^-^-Mz  +  2Mi  (10) 

sCiV  S^N 

Returning  to  equations  (1)  we  have: 
^  =  JV^[S^*+-(^i:i)-2X.-^i+  (S-l)(^'-2)  2M,X»J  (11) 
Since 

(SX)2  =  SZ^  +  2SXiX,- 

and 

SXSZ^  =  SZ'  +  SX^.Xj 

and 

=  2XX]Xi  +  6XXiX,Xk 

we  have: 

XX^Xi  =  SZ^SX  -  SZ' 
6SZ.Z,Z*  =  (SZ)'  -  SSZ^SZ  +  2SZ» 
Substituting  these  values  in  (11)  we  obtain: 
SZ»      1  r('S  -  N){S  -  2N)  SZ»     ZSjS  -  N)iN  -  1)  2Z'  ^ 

^  ~  ml  (s  -  IKS  -  2)   s  ^  {s-  Dis  -  2)    s  ■  '"'^ 

,  .S'^rAT  -  i)(Ar  -  2) 

(S  -  1)(S  -  2)  ■ 

Substituting  this  and  the  other  necessary  values,  previously  found, 
into  (10)  we  have: 

_{S-N){S-  2N)  rSZ'  32Z2 
"^■^  -iV^(S-l)(5-2)L~5  ^  •       +  2i»^^J 

_  (S  -  N){S  -  2N) 
Vz.z-  ^2(s  _  i)(S  -  2)*''-' 
and  hence 

If  S  is  infinite: 

«8,z  =  :;7^-a8,j  (13) 

Further,  if  the  parent  population  is  normal,^  aa.x  ==  0;  hence 
'as.z  =  0.  Therefore  the  skewness  of  the  distribution  of  sample  means 
chosen  from  a  parent  normal  distribution  is  zero, 

1  See  p.  405. 
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By  a  similar  procedure,  but  with  much  more  laborious  algebra,  it 
may  be  shown  that  for  the  distribution  of  Sample  means  given  by 
(1)  the  excess  is  given  by  the  formula: 

iV(S  -  iV)(S  -  2)(S  -  3)    l-^*'-^  N(S  -  iV)(S  -  2)(S  -  3) 

If  8  is  infinite: 

«4.z  -  3  =  ;^C«4.;r  -  3] 

If  the  parent  population  is  normal,  a^^x  —  3,  in  which  case 
<^4,  z  =  3.  Therefore,  we  may  say  that  the  6a:cess  of  distribution 
of  sample  means  chosen  from  a  normal  parent  population  is  zero. 

In  the  text  we  have  stated  that  the  distribution  of  means  is  a 
normal  distribution.  It  has  long  been  known,  probably  since  the 
time  of  Gauss,  that  if  random  samples  are  taken  from  a  universe 
distrihxded  normally ^  the  means  of  the  samples  also  form  a  normal 
distribution.  If  the  universe  is  non-normal,  not  a  great  deal  is 
known  at  present,  from  analytic  considerations,  about  the  distribu- 
tions of  statistics  of  samples.  However,  even  for  small  values  of  A^, 
there  is  sufficient  experimental  evidence  to  support  the  conclusion 
that  the  distribution  of  means  of  samples  selected  randomly  from  any 
finite  universe  is  practically  normal. 

We  have  shown  that  if  S  is  unlimited  and  is  large,  iif  =  1> 
^3,  M  =  0)  ^4,  M  ~  3.  By  a  continuation  of  this  same  method,*  under 
the  stated  hypotheses,  it  is  easy  to  show  that  a^^  m  =  0,  ^ 
=  1  •  3  •  5  =  15,  0,  Qfg,  3/  =  1  •  3  •  5  •  7  =  105,  and  so  on. 

That  is  to  say,  if  fairly  large  samples  are  taken  from  an  infinite 
universe,  the  moments  of  the  distribution  of  means  are  those  of  a 
normal  curve.  Further,  it  is  not  difficult  to  show  that  if  the  parent 
universe  is  infinite  and  distributed  according  to  the  Pearson  Type  III 
curve,  the  moments  of  the  distribution  of  sample  means  are  also  of 
the  Pearson  Type  III  curve.  However,  it  is  well  known  that  as  N 
increases  the  Type  III  curve  approaches  the  normal,  so  again  we  have 
the  property  that  as  grows  large,  the  curve  of  means  approaches 
normality. 

1  Richardson,  C.  H.,  The  Statistics  of  Sampling  ^  published  by  Edwards  Brothers, 
Ann  Arbor,  Michigan. 
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We  may  say  then  that,  so  far  as  the  practical  needs  are  concerned, 
the  distribution  of  meahs  has  been  rather  thoroughly  explored.  We 
regret  that  we  cannot  say  so  much  for  the  distribution  of  sample 
standard  deviations.  If  the  universe  is  normal  the  curve  of  the 
standard  deviations  of  samples  is  Type  III.  If  the  universe  is  non- 
normal,  we  do  not  know  the  distribution  function  of  <x,  not  even  the 
values  of  the  moments.  However,  by  working  through  the  moments  of 
the  variance  ( =  (7^)  we  arrive  at  the  facts  contained  in  the  next  section. 


112.   THE  RELIABILITY  OF  THE   STANDARD  DEVIATION 

In  Section  110  (p.  425)  we  outlined  an  experiment  that  was  intended 
to  explain  to  the  reader  what  is  meant  by  a  sample  mean  and  a 
sample  standard  deviation.  Each  sample  drawn  has  its  mean,  its 
standard  deviation,  et  cetera.  In  order  to  introduce  the  reader  to  the 
problem  of  sampling,  we  have  shown  in  considerable  detail  in  the 
preceding  section  how  we  may  characterize  the  distribution  of  sample 
means.  We  were  especially  interested,  however,  in  finding  measures 
of  the  reliability  of  the  mean,  which  measures  we  found  in  (Tj^^  and 
Em{^  .6745cr^). 

The  sample  standard  deviations  in  like  manner  form  a  distribution 
that  may  be  characterized  by  its  mean,  Ma  (the  mean  of  the  standard 
deviations),  its  standard  deviation,  (the  standard  deviation  of  the 
standard  deviations),  and  so  on.  We  are  especially  interested  in 
(7^  or  Efjj  by  which  we  measure  the  variability  and  the  reliability  of 
any  sample  standard  deviation. 

The  algebraic  development  showing  the  derivation  of  and  o*^ 
would  take  us  too  far  afield.  It  can  be  shown  that  if  the  parent 
population  is  normal  and  N  is  large,  the  mean  of  the  distribution  of 
standard  deviations  is  approximately  equal  to  the  standard  deviation 
of  the  parent  population,  and  the  standard  deviation  of  the  distribution 
of  standard  deviations  is  approximately  equal  to  the  standard  deviation  of 
the  parent  population  divided  by  the  squxire  root  of  twice  the  number 
of  variates  in  the  sample}  That  is: 

Ma  =  (r„ 

(Tu 

0-<r  = 


V2N 

*  See  formula  (24)  of  Section  114. 
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Since  the  standard  deviation  of  the  parent  population,  (7u,  is  ap- 
proximately equal  to  the  standard  deviation  of  the  sample,  cr,  we 
have: 

Ma  =  (J  approximately 

(7 


and 


approximately  (14) 


Ea  =  0.6745(r<,  =  0.6745  (15) 

The  meaning  of  is  similar  to  that  given  for  Eji^.  Thus,  it  is 
customary  to  write  the  standard  deviation  in  the  form 

(Tu  =  <r  ± 

which  means  that,  assuming  that  the  curve  of  sample  standard 
deviations  is  approximately  normal,  half  the  sample  standard 
deviations  lie  within  the  range  whose  end  values  are  (Tu  —  Ea  and 
Cu  +  E(T.  It  also  means  that  the  chances  are  even  that  the  sample 
a  does  not  differ  from  Cu  by  more  than  dz  E^^  and  it  is  practically 
certain  that  the  sample  (7  does  not  differ  from  cTu  by  more  than 
±  4.5(£;a). 

113.   THE  RELIABILITY  OF  THE  DIFFERENCE 
BETWEEN  TWO  MEASURES 

An  important  problem  in  applied  statistics  is  the  determination 
of  some  criterion  that  will  assist  one  in  judging  whether  an  observed 
difference  between  two  samples  is  apparent  or  real.  That  is,  is  the 
difference  between  two  samples  such  that  it  might  arise  from  sampling 
(that  is,  from  pure  chance),  or  is  the  difference  significant  of  a  greater 
variation  in  the  two  samples  than  can  be  explained  by  random 
sampling  alone? 

Suppose  we  select  from  a  normal  parent  population  two  samples, 
each  fairly  large.  Each  sample  has  its  mean,  its  standard  deviation, 
et  cetera.  The  two  means  will  not  likely  be  equal  and  hence  we  shall 
have  a  difference  of  two  means.  Also,  the  standard  deviations  will 
not  likely  be  equal  and  hence  we  shall  have  a  difference  of  two  stand- 
ard deviations.  Continue  this  process  until  we  have,  say,  m  pairs  of 
samples,  m  usually  a  large  number,  and  hence  m  differences  in  means 
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that  will  constitute  a  distribution  of  differences  in  sample  means. 
From  these  m  pairs  of  samples  we  may  also  hSlve  m  differences  in 
standard  deviations  that  will  constitute  a  distribution  of  differences 
of  sample  standard  deviations. 

Let  Xi  and  F»  be  used  to  distinguish  corresponding  characteristics 
—  means,  standard  deviations,  et  cetera  —  of  two  groups  when  the 
ith  pair  of  samples  has  been  taken,  and  Xi  —  Yi  be  the  difference 
in  any  pair  of  corresponding  characteristics. 


Table  99 


Sample 
Pair 

Group  I 
X 

Group  II 
Y 

Difference 
X-Y 

1 

Y, 

X,  -  Ft 

2 

x^ 

Y^ 

X2  —  Yi 

•  • 
• 

.  . 

•  • 

Yi 

Xi  -  Yi 

•  • 

m 

•  • 

Xm 

•  • 

Y„ 

Xm         Y rn 

We  shall  find  the  arithmetic  mean  and  the  standard  deviation  for 
the  distribution  of  differences: 


■      -  S »  -  Q!  _  [5(^^T       (by  (7),  Chapter  4) 


m      \m }       m      \m  J        \_  m  mmj 
Using  (7)  on  p.  128,  and  (7)  on  p.  245,  we  have: 

<^l-r  =      +  <r?  -  ^r^r^gCTy  (17) 

If  the  two  distributions,  X  and  Y,  are  independent  so  that  vxy 
is  zero,  then: 

a^^y  =  V^f+Vl  (18) 
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We  are  especially  interested  in  (18)  when  X  and  Y  are  the  corre- 
sponding means  of  two  samples.  Then 

'^M^-My   =    V^V^^  (19) 

gives  a  measure  of  the  variability  (or  the  reliability)  of  the  differences 
in  two  sample  means.  Also 

^  =  V(Ti  +  (20) 

X —  Y  X  Y 

gives  a  measure  of  the  variability  and  the  reliabiUty  of  the  differences 
in  two  sample  standard  deviations.  The  formulas  for  the  correspond- 
ing probable  errors  are  found  by  multiplying  (19)  and  (20)  by 
0.6745.  Thus: 

^M^-«y  =  0-6745  Vaj^^  +  (Thy  (21) 

^"x-Y  =  ^'^'^^  V^I^T^  (22) 

Let  us  consider,  for  illustration,  the  results  on  the  placement 
examination  in  mathematics  of  two  different  freshman  classes  at 
Bucknell  University. 

Group  I  Group  II 

Nx  =  329  Ny  =  302 

Mx  =  32.75  My  =  30.60 

cTx  =  8.05  ay  =  6.95 

The  difference  between  the  two  means  is  32.75  —  30.60  =  2.15. 
Is  this  difference  so  large  that  it  could  not  be  due  to  chance  or  does 
it  indicate  that  Group  I  really  demonstrated  a  significantly  better 
training  in  elementary  mathematics? 

There  is  no  question  about  the  observed  difference  in  the  two 
means.  It  is  certainly  2.15.  Could  such  a  difference  be  due  to 
chance?  Yes,  such  a  difference  could  be  due  to  chance  but  we  shall 
show  that  the  likelihood  that  it  did  arise  from  chance  is  so  small 
that  we  feel  justified  in  neglecting  it  and  in  assuming  that  the  dif- 
ference has  been  caused  by  other  factors  than  pure  chance.  When 
such  is  the  case,  the  statistician  says  "the  difference  is  significant." 
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Using  (6),  (19),  and  (21): 

(Tm-^  =  0.444  (TM^-Mjr  =  0.597 

a  My  =  0.399  Eu^-My  ==  0.403 

We  may  now  state  that  the  chances  are  even  that  an  observed 
{Mx  —  Mjr)  is  within  =fc  0.403  of  the  true  (unknown)  value,  and  it  is 
practically  certain  that  an  observed  {Mx  —  My)  is  within  db  3(0.597) 
or  ±  4.5(0.403)  of  the  true  value.  Following  custom,  we  describe 
this  variation  by  writing  2.15  =fc  0.403  which,  translated  into  English, 
reads  **2.15  with  a  probable  error  0.403. It  may  be  noticed  in- 
cidentally that  the  difference  2.15  is  3.6  times  its  standard  error  and 
6.3  times  its  probable  error.  Such  a  large  numerical  difference  as 
this  would  rarely  occur  by  pure  chance,  in  fact,  about  4  times  in 
10,000.  When  the  happening  of  an  event,  such  as  this  under  dis- 
cussion, is  extremely  unlikely,  we  conclude  that  some  factors  other 
than  pure  chance  have  influenced  the  result. 

While  proofs  for  all  the  statements  are  beyond  the  scope  of  this 
text,  other  pertinent  facts  are  the  following.  If  many  independent 
sample  pairs  are  taken  from  normal  parent  populations,  the  differ- 
ences (indicated  by  D)  of  means,  standard  deviations,  etc.  also  form 
approximately  normal  distributions.  As  may  be  expected,  the  mean 
of  the  distribution  of  differences,  Md,  is  zero  and  the  standard  devi- 
ation, (Ti),  is  given  by  (18).  The  probable  error  of  the  D  distribution 
is  of  course      =  0.6745o'2).  It  is  customary  to  take 

M.,   or  fl-* 

as  the  criteria  whereby  one  can  quickly  determine  if  the  difference 
D  is  significant.   As  a  ''rule  of  thumb''  we  say: 

if  ^  >  3,  (or  if  /c  >  4.5),  the  difference  is  certainly  significant; 

if  ^  >  2,  (or  if  A:  >  3),  the  difference  is  possibly  significant; 

if  ^  <  2,  (or  if  fc  <  3),  the  difference  is  probably  not  significant. 

These  limits,  however,  are  arbitrary,  and  consequently  vary  among 
the  authorities. 

In  the  particular  problem  of  this  section: 
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^  ^  D  ^  Mx  -  My  ^  2Ab^ 

(TD        (TM^-'My  0.597 

Hence,  from  a  comparison  of  the  means,  we  would  conclude  that 
Group  I  and  Group  II  came  from  statistically  different  parent  popula- 
tions; or,  if  from  the  same  parent  population,  then  other  factors  than 
pure  chance  must  have  caused  a  numerical  difference  as  large  as  2.15. 

The  assumptions  underlying  this  procedure  deserve  a  brief  con- 
sideration. The  universe  difference  in  the  means,  or  other  statistics, 
is  assumed  to  be  zero.  Is  this  a  reasonable  assumption?  We  think  it 
is.  Let  the  reader  return  to  Table  99,  and  remember  that  each 
sample,  fairly  large,  is  drawn  from  a  normal  parent  universe.  It 
would  seem  then  that  of  the  m  differences  of  {Xi  —  7*),  negative 
differences  would  occur  about  as  frequently  as  positive  differences 
and  of  equal  numerical  amounts  so  that  their  sum  S  {Xi  —  Ft)  would 
theoretically  equal  zero.   Hence,  theoretically  Mx  =  My, 

R.  A.  Fisher  terms  such  an  hypothesis  a  ^^null  hypothesis,^^  the 
hypothesis  that  there  is  no  difference.  So  in  our  applications  we 
try  to  give  the  facts  a  chance  to  nullify  the  hypothesis.  We  make 
no  effort  to  prove  it  or  to  disprove  it;  rather  do  we  attempt  to  cast 
doubt  upon  it. 

In  our  illustrative  example  we  sought  evidence  that  the  two  samples 
came  from  different  universes.  Very  well,  on  the  basis  of  large 
sample  theory,  we  began  by  assuming  they  came  from  the  same 
universe  with  Mp  —  0  and  (Td  =  .597.  It  is  expected  that  practically 
all  of  the  actual  differences  will  fall  within  0  zfc  Sor^^.  If,  therefore, 
the  actual  difference  D  exceeds  3cr£>  numerically,  then  it  is  reasonable 
to  conclude  that  our  assumption  of  the  same  universe  is  probably 
wrong.  Thus  we  conclude  that  the  two  samples  came  from  different 
universes. 

Of  course  we  may  wish  to  see  what  light  a  comparison  of  the 
variabilities  of  the  samples  will  throw  upon  our  problem.  We  find, 
using  (14)  and  (20), 

<T.   =  =  0.314      (7.   =  -4=^  =  0.283 

^     V2(329)  ^  \/2(302) 


<ra^-<r^  =  \/(0.314)2  4.  (0.283)2  -  0.423 
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t  = 


D      8.05  -  6.95 


(Td 


0.423 


2.6 


So  a  comparison  of  the  standard  deviations  supports  the  previous 
conclusion  since  t  >  2. 

As  this  problem  well  illustrates,  in  investigating  the  significance 
in  differences  it  is  a  wise  procedure  to  penetrate  the  problem  as 
deeply  as  possible. 

Example.  Two  samples  of  weights  of  male  students  gave  the  following  in- 
formation: Ni  =  100,  Ml  =  140.4  lbs.,  ci  =  17.7  lbs.;  N2  =  100,  M2 
=  136.8  lbs.,  <T2  =  16.2  lbs.  If  other  samples  are  taken,  what  is  the 
probability  that  an  observed  difference  in  the  means  will  be  numerically 
equal  to  or  greater  than  D  =  140.4  -  136.8  =  3.6  lbs.? 

Solution. 

<tm„  =    .        =  l.bJ 


17.7      ^  _ 


VlOO 


vToo 


<r^  =  V(1.77)2  +  (1.62)2  =  2.4 


o-n  2.4 


2M 


00 
1  5 


2(0.5000  -  0.4332)  =  0.1336 


That  is,  we  would  obtain  a  difference  numerically  as  large  as  3.6  about 
134  times  in  1,000. 


Figure  61 


Curve  of  Differences,  D, 
bettoeen  Sample  Means 
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EXERCISES 

1.  The  following  table  gives  the  distribution  of  the  weights  of  1,000  female 
students  subdivided  into  ten  random  samples,  each  of  100  individuals. 
Measurements  were  recorded  to  nearest  1/lOth  pound. 


Frequencies 


Class 

1st 

ma 

3rd 

4th 

6  th 

6th 

7th 

ml 

8th 

9th 

10th 

lUU 

lUU 

lUU 

inn 

lUU 

in/1 

1  nn 
lUU 

1  nn 

inn 

luu 

1  nn 

luu 

Total 

74.95 

1 

1 

2 

84.95 

1 

4 

1 

1 

4 

3 

2 

16 

Q 

V 

4 

9 

10 

14 

Q 

G 

Q 

104.95 

22 

18 

23 

19 

29 

23 

26 

30 

20 

21 

231 

114.95 

30 

24 

23 

31 

28 

17 

21 

31 

22 

21 

248 

124.95 

25 

21 

19 

22 

18 

21 

15 

15 

16 

24 

196 

134.95 

9 

16 

15 

9 

12 

12 

10 

7 

19 

13 

122 

144.95 

2 

7 

5 

6 

6 

8 

10 

5 

6 

8 

63 

154.95 

2 

3 

1 

2 

2 

4 

2 

4 

3 

23 

164.95 

1 

0 

0 

0 

1 

1 

1 

1 

0 

5 

174.95 

2 

0 

3 

1 

0 

1 

7 

184.95 

0 

0 

1 

1 

194.95 

0 

2 

2 

204.95 

1 

0 

1 

214.95 

1 

1 

Total 

100 

100 

100 

100 

100 

100 

100 

100 

100 

100 

1000 

M 

(T 

Let  the  universe  be  the  "total"  group. 

a.  Compute  M  and  <t  for  each  sample  and  for  the  total. 

b.  Compare  the  mean  of  the  ten  sample  means  with  Afu. 

c.  Compare  the  mean  of  the  ten  sample  a's  with 

d.  Using  (Tj^f  =  ^^^^  many  of  the  ten  sample  means  are  within 
the  five  per  cent  level  of  significance? 

e.  Using  <t„  =    ^—  >  how  many  of  the  ten  sample  a's  are  within  the 

V2OO 

five  per  cent  level  of  significance? 

f.  Do  you  beUeve  that  randomness  went  awry  on  any  sample? 


2.  {Tippettj  p.  70)  The  lengths  of  4,000  hairs  of  an  Indian  cotton  gave 
M  =  2.33  cm.  and  a  =»  0.4806  cm.     The  first  thousand  hairs  were  selected 
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by  a  different  method  from  the  rest  and  gave  a  mean  of  2.54  cm.  Is  this 
deviation  compatible  with  the  hypothesis  that  the  1,000  are  a  random 
sample  from  the  4,000  and  that  the  difference  in  means  is  due  to  random 
errors,  or  is  the  difference  large  enough  to  indicate  that  the  change  in 
technique  has  had  an  effect?" 

3.  A  contractor  purchased  a  certain  type  of  copper  sheeting  from  a 
manufacturer.  The  contract  specified  that  the  sheets  were  to  meet  a 
theoretical  standard  —  universe  mean  —  of  thickness  0.022  inch.  The 
contractor  measured  a  sample  of  100  sheets  and  found  M  =  0.020  inch 
and  a  =  0.003  inch.   Did  the  contractor  have  reason  to  complain? 

4.  For  ten  years  we  at  Bucknell  University  have  given  to  the  in-coming 
freshmen  a  standardized  test  in  pre-coUege  mathematics.  Based  upon 
this  experience  with  S  =  4,000  we  have  established  the  norms  for  the  test: 
Mu  =  62,  oTu  =  18.  The  freshman  class  of  400  admitted  in  September 
1939,  Class  of  1943,  took  the  test  with  the  results:  M  =  58  and  a  =  16. 
Would  you  agree  that  the  Class  of  1943  was  significantly  ill-prepared  in 
mathematics?  The  Class  of  1945  with  N  =  400  took  the  test  with  the 
results:  M  =  60.5  and  a  —  16.5.  Is  the  Class  of  1945  within  the  five  per 
cent  level? 

6.  During  a  given  month  one  machine  produced  900  units  but  spoiled 
3.2  per  cent  of  them.  During  the  same  month  another  machine  with  a 
more  experienced  operator  produced  1,000  units  but  spoiled  2.8  per  cent 
of  them.   Is  the  percentage  difference  in  spoilage  significant? 

6.  A.  S.  Parkes  and  J.  C.  Drummond  (Proc.  Roy.  Soc,  B,  XCVIII, 
p.  147)  gave  the  following  data  showing  the  effect  of  vitamin  B  on  the 
sex-ratio  of  offspring  in  rats.  May  the  percentage  change  in  males  be 
attributed  to  chance,  or  is  the  evidence  sufficient  to  warrant  that  the 
change  was  due  to  the  increased  vitamin  B? 


Diet 

Males 

Females 

Total  Young 

Per  cent  Males 

Vitamin  B  Deficient 

123 

153 

276 

44.57 

Vitamin  B  Sufficient 

145 

150 

295 

49.15 

Totals 

268 

303 

571 

114.   SMALL  SAMPLES 

The  formulas  for  estimating  the  reliability  of  a  statistic  that  we 
have  given  previously  are  suitable  when  N  is  reasonably  large,  say 
30  or  more,  but  require  modification  when  is  small.  When  N  is 
small,  the  o-  of  a  sample  which  is  used  as  an  estimate  of  Cu  gives 
values  too  small  and  thus  our  standard  errors  have  a  downward  bias. 
To  overcome  this  bias  we  need  to  develop  a  theory  that  will  give 
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US  a  better  estimate  of  the  standard  deviation  of  the  universe  (r„ 
than  is  given  by  cr.  We  shall  now  attack  the  problem  of  finding  the 
standard  deviation  which  gives  the  better  estimate  for  (Tu. 

Consider  the  parent  universe  Xi,  X2,  .  .  .,  Xs.  Transform  the 
S  variates  to  the  mean  of  the  universe  Mu  as  origin  and  denote  them 

by  Xij  X2y  .  .      Xs  where  Xi  =  Xi  —  M„.    We  then  have  for  the 

s 

universe  Xxi  =  0. 
t=i 

From  this  universe  we  choose  samples  of  N.  In  all  we  may  choose 
fiCjv  samples.  Each  sample  has  its  second  moment  and  thus  in  all 
we  have  ^Cn  second  moments.  These  second  moments  give  us 
a  distribution  of  sample  second  moments.  It  is  our  immediate 
problem  to  find  the  mean  of  these  s^n  sample  second  moments. 

Let 

m2,  k  =  the  second  moment  of  the  fcth  sample  about  the  mean 
of  the  sample 

Then,  for  the  fcth  sample,  we  have 

_  /SxV  _  Xx^  (Xxy 

N      \N )  ~  N  ^ 

Since  (Sa:)^  =       +  2l^XxXjj  we  have 

where  i  9^  j,  and  the  S^s  cover  only  the  sample. 
The  mean  of  the  distribution  of  second  moments  is  given  by 

where  the  S's  cover  the  entire  universe. 
Again  returning  to 

(Sa;)2  =  Sx^  +  2'ZxiXu 
we  note  that  for  the  universe  So:  =  0,  and  hence 


-  2Sx<a;/  -  Sa;»  =  Sal 
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Then,  substituting  and  simplifying, 


=  M..  =  ^tri]  (23) 


If  S  becomes  infinite, 


=  al  (24) 


That  is,  if  the  parent  population  is  very  large,  the  expected  jU2 
or     is  (A^  —  \)/N  times  the  parent  juj,  u  or  (tI. 

If  we  replace  in  (24)  ikf  o-t  by  o-^  sampu  or  o-^,  and  by  cTu,  euimnted  or 
Cu.est.j  where  (Tu,  e^^.  is  the  best  estimate  of  the  standard  deviation  of 
the  universe  from  the  samphy  (what  R.  A.  Fisher  calls  the  maximum 
likelihood  estimation  of  au  from  a  sample)  we  have 


^    "  ^u,est. 


-JHZ 

cru.„t.  -  y  ^  _ 


(25) 


If,  as  is  customary,  we  find  cr  for  a  sample  of  items  by  the 
formula 


=  4 


<r  =  i  /  rjf'  (26) 


N 

we  obtain 


_    /  /  Sa: 


(27) 


Consequently,  if  we  must  estimate  (Tu  from  a  sample,  formula  (27) 
gives  a  better  estimate  than  the  customary  one  (26).  Of  course 
if  N  is  large,  it  is  a  matter  of  httle  consequence  whether  we  divide 
by  or  by  (iV  —  1),  but  when  N  is  small,  say  less  than  30,  the  use 
of  (AT  —  1)  is  particularly  important. 

We  immediately  find,  for  A^  small,^ 


\  N  -  1 


1  The  introduction  of  the  factor  V  77  7      (25)  is  called  ''BesseFs  correc- 


4  / 

tion,*'  and  the  formula  for  the  standard  error  of  the  mean  y  ^  is  called 
*^Bessers  formula"  [Friedrich  Wilhelm  Bessel  (1784-1846)]. 
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where  a  is  computed  from  (26).  Corresponding  formulas  for  Ej^ 
and      are  immediately  found. 

We  may  thus  compute  standard  and  probable  errors  for  the  various 
statistics,  mean,  standard  deviation,  differences,  and  so  on  when  N 
is  small  by  using  formulas  (27),  (28),  (29),  and  their  substitutions 
in  (21)  and  (22).  A  word  of  caution  is  in  order  with  regard  to  apply- 
ing them  to  establish  probability  levels. 

In  our  previous  discussion  we  have  used  the  values  of  the  normal 
curve  to  assist  in  interpreting  the  values  of 

M  —  Mu      (J  —  (Tu     Ms  —  Ml 

 ,  ,   

because  the  distributions  of  these  quantities  are  closely  normal  whm 
N  is  large.  When  is  small,  these  distributions  deviate  from 
normality,  the  amount  of  the  deviation  increasing  as  A^  decreases. 
A  special  table  has  been  devised  by  R.  A.  Fisher  which  gives  values 
of  t  for  various  degrees  of  freedom^^  n,  (n  =  A^  —  1  in  the  above 
formulas)  and  various  probabilities  P  that  an  observed  value  may 
differ  from  zero  by  more  than  it  Or  it  gives  values  of  t  for  given 
levels  of  significance  and  given  values  of  n. 

This  table  differs  considerably  from  that  of  the  normal  curve. 
For  example,  in  the  normal  curve  with  N  =  11  or  n  ==  10,  the  1  per 
cent  level  of  significance  is  at  /  =  ±  2.58  whereas  in  the  Fisher  table 
the  value  of  ^  is  ^  =  ±  3.17.  When  N  is  larger  than  20,  the  differences 
are  not  so  appreciable,  and  when  N  is  greater  than  30  the  normal 
table  may  be  used  with  slight  error.  This  Fisher  table  is  found  in 
the  texts  by  Fisher  and  by  Croxton  and  Cowden  listed  in  the  Ap- 
pendix. A  general  idea  of  the  table  may  be  obtained  from  the 
portion  that  we  reproduce  on  page  454. 

In  the  use  of  this  table  remember  that  a  level  of  significance'' 
refers  to  both  tails  of  the  distribution.  Note  too  that  it  is  set  up 
differently  from  the  table  of  areas  for  the  normal  curve.  A  tail 
of  the  normal  curve  is  found  by  subtracting  the  tabulated  value  from 
0.5000,  and  doubUng  this  value  yields  the  level  of  significance. 
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Tablb  100.  Values  of  t  for  Degrees  of  Freedom  n  and  Levels 

OF  Significance  P 


Level  of  Significance 


n 

.9 

.7 

.5 

.3 

.1 

.05 

.01 

4 

.134 

.414 

.741 

1.190 

2.132 

2.776 

4.604 

5 

.132 

.408 

.727 

1.156 

2.015 

2.571 

4.032 

6 

.131 

.404 

.718 

1.134 

1.943 

2.447 

3.707 

8 

.130 

.399 

.711 

1.108 

1.860 

2.306 

3.355 

10 

.129 

.397 

.706 

1.093 

1.812 

2.228 

3.169 

15 

.128 

.393 

.691 

1.074 

1.753 

2.131 

2.947 

20 

.127 

.391 

.687 

1.064 

1.725 

2.086 

2.845 

30 

.127 

.389 

.683 

1.055 

1.697 

2.042 

2.750 

00 

.1257 

.3853 

.6745 

1.0364 

1.6449 

1.9600 

2.5758 

Table  100,  however,  shows  n  (degrees  of  freedom)  in  the  stub,  P 
(the  level  of  significance)  in  the  caption,  and  t  in  the  body  of  the 
table.  The  last  line  of  the  table  for  n  =  oo  shows  values  of  t  obtained 
from  the  normal  curve. 

Exercise:  Show  that  cTu.  est.  may  be  found  from 


■  -J 

u,  eat,  —  1/ 


NSX2  -  (SX)2 


N(N  ~  1)  ^^^^ 

Illustrative  Example  1.  A  corporation  has  set  as  a  standard  the 
mean  breaking  strength  of  a  certain  type  of  wire  at  582  pounds. 
A  sample  of  10  specimens  was  tested  with  the  results  shown  in 

Table  101.  Breaking  Strength  of  Wire 


Specimen 

Breaking  Strength 
{pounds) 
X 

X 

1 

581 

2 

4 

2 

676 

~  3 

9 

3 

584 

5 

25 

4 

586 

7 

49 

5 

575 

~  4 

16 

6 

573 

-  6 

36 

7 

574 

-  5 

25 

8 

572 

-  7 

49 

9 

588 

9 

81 

10 

581 

2 

4 

Total 

5790 

0 

298 
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Table  lOL  For  the  purposes  for  which  the  wire  is  used,  values 
within  the  5  per  cent  level  are  tolerated.  Does  this  sample  meet  the 
requirements? 

Solution: 

We  have  M  =         =  579  pounds. 


V  10 


'j~  =  5.46  pounds. 


—        =  1-82  pounds. 
v9 

In  the  t  table  for  n  =  iV  —  1  =  9,  we  have  at  the  5  per  cent  level, 
t  =  ±  2.3.  That  is,  a  variation  of  ±  2.3(1.82)  or  db  4.14  pounds  on 
either  side  of  582  pounds  is  tolerated.  Hence  the  toleration  limits 
are  (582  zb  4.14)  pounds  or  from  577.76  pounds  to  586.76  pounds. 
Certainly  579  pounds,  the  mean  of  the  sample,  is  well  within  these 
limits. 

Illustrative  Example  2.  Table  102  gives  data  on  strength  tests 
(lbs.  per  sq.  in.)  on  two  types  of  wool  fabric.  Is  the  difference  in 
the  means  sufficient  to  warrant  the  conclusion  that  Type  2  is  superior 
to  Type  1? 

Table  102 


Typ 

e  1 

Typ 

e  2 

Specimen 

Strength 

Specimen 

Strength 

1 

139 

1 

137 

2 

127 

2 

132 

3 

134 

3 

135 

4 

125 

4 

144 

5 

141 

5 

131 

6 

144 

6 

133 

7 

128 

7 

136 

8 

138 

8 

134 

9 

131 

9 

139 

10 

133 

10 

129 

For  these  data  we  find 
Ni  =  10 

M\  =  134  lbs.  per  sq.  in. 
0*1  =  6.05  lbs.  per  sq.  in. 


iVa  =  10 

M2  =  135  lbs.  per  sq.  in. 
<r2  =  4.09  lbs.  per  sq.  in. 
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fTj^  =        ^      =  2.02  lbs.  per  sq.  in. 

ctm^  =    .  =  1.36  lbs.  per  sq.  in. 

VN2  —  1 


=  V(2.02)2  +  (1.36)2  =  2.4  lbs.  per  sq.  in. 

,      D      135  -  134  ... 

6  =  —  =  =  -417 

ctd  2.4 

From  Table  100,  f or  n  =  AT  -  1  =  9  and  i  =  .417  we  estimate  P 
at  about  0.7,  indicating  that  a  difference  of  1  lb.  per  sq.  in.  might 
occur  7  times  in  10.  There  is  thus  no  evidence  to  support  a  contention 
that  Type  2  is  superior  to  Type  1. 


115.  CONCLUDING  REMARKS  ON  SAMPLING 

The  statistical  theory  of  sampling  is  a  fundamental  and  basic 
problem  in  mathematical  statistics.  It  has  challenged  and  continues 
to  challenge  some  of  our  best  minds.  The  reader  who  may  wish  to 
pursue  the  problem  further  will  find  the  following  articles  interesting 
and  not  too  difficult. 

H.  C.  Carver,  Fundamentals  of  the  Theory  of  Samplingy  Annals  of 
Math.  Statistics,  Vol.  I,  page  101. 

C.  C.  Craig,  An  Application  of  Thiele^s  Semi-invariants  to  the  Sam- 
pling Problem^  Metron,  Vol.  VII,  No.  4. 

W.  E.  Deming  and  R.  T.  Birge,  Statistical  Theory  of  Errors,  The 
Graduate  School  of  U.S.  Dept.  of  Agriculture,  Wash.,  D.C. 

Dunham  Jackson,  The  Theory  of  Small  Samples,  Amer.  Math. 
Monthly,  June- July,  1935. 

C.  H.  Richardson,  The  Statistics  of  Sampling,  Edwards  Brothers, 
Ann  Arbor,  Michigan. 

H.  L.  Rietz,  Topics  in  Sampling  Theory,  Bulletin  of  the  American 
Mathematical  Society,  April,  1937. 

W.  A.  Shewhart,  Economic  Control  of  Quality  of  Manufactured  Prod- 
uct, D.  Van  Nostrand  Co.,  New  York  City. 

116.  SUMMARY  OF  RELIABILITY  FORMULAS 

In  this  chapter  we  have  undertaken  only  to  introduce  the  reader  to 
what  Karl  Pearson  has  called  the  fundamental  problem  in  statistics, 
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namely,  the  problem  of  sampling.  To  do  more  in  an  elementary 
text  would  not  be  good  judgment  on  our  part.  A  list  of  the  probable 
errors  that  are  needed  most  frequently  follows  and  includes  a  few 
which  we  are  not  in  a  position  to  derive  here.^ 

Statistical  Constant 
The  arithmetic  mean 

The  median  (normal  distribution) 

The  standard  deviation  (normal 
distribution) 

The  coefficient  of  correlation  (nor- 
mal distribution) 

« 

az  for  a  normal  distribution 


for  a  normal  distribution 


EXERCISES 

1.  For  the  distribution  of  scores  in  English,  (a)  of  Exercise  4,  page  102, 
we  have  found  N  =  334,  M  =  149.8,  a  =  42.47.  Find  Em  and  interpret  it. 
Also  find  (T(T  and  interpret  it. 

2.  For  the  distribution  of  tlie  lengths  of  eggs,  (a)  of  Exercise  15,  page 
105,  we  have  found  .V  =  450,  M  =  56.323  mm.,  a  =  2.386  mm.  What  is 
the  probability  that  the  sample  mean  does  not  differ  from  the  universe 
mean  by  more  than  ±  0.09  mm.?  What  is  the  probability  that  the 
sample  dispersion  does  not  differ  from  the  true  dispersion  of  the  universe 
by  more  than  ±  0.07  mm.? 

3.  Find  (Xm  and  aa-  for  the  distribution  of  pulse  beats,  Table  29,  page 
165.  Find  the  probability  that  the  sample  mean  does  not  differ  from  the 
universe  mean  by  more  than  db  1.0  pulse  beats  per  minute. 

4.  Assuming  normality,  find  <rr  and  Er  for  the  data  of  Table  59,  and 
interpret  them. 

6.  Find  Em  for  the  data  of  the  chest  measurements  of  men.  Exercise  10, 
page  168,  and  interpret  it. 

1  For  the  probable  errors  of  other  constants,  see  Rietz  and  others,  op.  cU,  p.  77. 


Probable  Error 
0.6745(r 

vn 

0.8454(7 

Vn 

0.6745(r  ^  0.4769(7 

V~2N  "  Vn 

1  -  r2 


1  -  r' 
0-6745-^ 


0.6745 
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6.  The  following  are  summaries  of  the  results  on  placement  tests  in 
English  which  were  given  to  two  freshman  classes  entering  Bucknell  Uni- 
versity. 

Group  I  Group  II 

iV  =  334  N  =  302 

M  =  149.79  M  =  158.37 

a  =  42.47  <r  =  36.28 

Is  the  difference  between  the  means  significant? 

7.  The  heights  of  two  groups  of  soldiers  were  measured  and  the  follow- 
ing results  were  secured: 


Group  I  Group  II 

N  =  10,000  N  =  10,000 

M  =  67.51  inches  M  =  62.24  inches 

a  =  2.20  inches  a  =  2.25  inches 

Is  the  difference  in  the  means  sufficient  to  warrant  belief  that  the  two 
groups  were  chosen  from  different  races? 

8.  We  present  below  two  frequency  distributions  based  upon  the  batting 
averages  of  players  in  the  National  and  the  American  leagues  during  the 
early  part  of  the  1925  season.   (See  the  accompanying  table.) 


Frequency  Distribution  of  Batting  Averages  ^ 


Batting  Average 

Number  of  Players  in 
the  National  League 
untk  the  Given  Average 

Number  of  Players  in 
the  American  League 
with  the  Given  Average 

.050-.099 

3 

0 

.100-149 

7 

11 

.15a-.199 

11 

11 

.200'.249 

21 

22 

.250-.299 

31 

35 

.300-.349 

34 

28 

.350-.399 

18 

13 

.400-.449 

4 

6 

.450-.499 

0 

0 

.500-.549 

3 

2 

.550-.599 

0 

1 

Is  the  difference  in  the  means  of  these  distributions  significant? 

1  New  York  Herald  Tribune,  May  17,  1925.  See  also  F.  C.  Mills  and  D.  H 
Davenport,  Manual  of  Problems  and  Tables  of  Statistics,  1925,  p.  65. 
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9.  In  the  theory  of  the  chapter  we  have  assumed  that  the  parent  popu- 
lation consisted  of  the  S  variates  Xi,  Z2,  .  .  Xs>  We  proved  that 
Mz  =  Mx,  where  Mz  is  the  mean  of  the  distribution  of  sample  means  and 
Mx  is  the  mean  of  the  parent  population.  Let  us  now  transform  the  S 
variates  to  this  mean  as  origin  and  denote  them  by  Xi  =  Xi  —  M;^,  (i  =  1, 

Let  Zi  he  the  ith  sample  mean  of  N  variates  chosen  from  the  population 
Xi,  X2,  .  .  xs'  We  may  have  the  s(^n  distinct  sample  means  which  are 
given  by  the  following  equations: 

1  _ 

If-  -I 
22  =  +  X2  +  •  •  •  +         +  J 


N 

s 

Recalling  that  l^Xi  =  0: 

i  =a  I 

a.  Show  that: 

b.  Show  that: 

2^2 


S^^       1  VN  2N(N  -  1)  1 


which,  upon  applying  the  proper  symmetric  products,  reduces  to: 


S  -  N   2xi       S  -  N  2 


sCn     N{S  -l)S      N(S  -  1) 
c.  Use  a.  and  b.  and  show  that: 


^  - 


NiS  -  1) 
d.  Show  that: 


which,  upon  applying  the  proper  symmetric  products,  reduces  to 

^      ^  N){S  -  2N)  i:x\ 
sC^  ^  N^S  ~  DiS  -  2)  S 
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and  finally  to: 

(S  -N)(S-  2N) 
"  m{S  -  1)(S  -  2)^^'^ 

10.  Distributions  of  the  heights  of  men  born  in  England  and  in  Scotland 
gave  the  following  results: 

England  Scotland 

N  =  6,194  N  =  1,304 

M  =  67.4375  inches  M  =  68.5456  inches 

a  =  2.548  inches  a  =  2.480  inches 


Is  the  difference  in  the  means  sufficient  to  conclude  that  Scots  are  really 
taller  than  Englishmen? 

11.  A  distribution  of  150  people  in  normal  condition  gave  an  average 
pulse  rate  of  79.68  ±  0.15  beats  per  minute  but  after  being  administered 
a  certain  drug  they  showed  an  average  pulse  rate  of  81.12  ±  0.20  beats 
per  minute.  Is  it  probable  that  the  increase  in  the  pulse  rate  was  due  to 
the  drug,  or  is  the  increase  simply  a  result  of  variation  due  to  sampling? 

12.  For  the  distributions  of  wages  received  by  clothing  workers  m 
Cincinnati,  Cleveland,  and  St.  Louis  we  have  found  the  values  given  in 
the  table.   Are  the  differences  of  the  means  significant?   [See  page  75.] 


Cincinnati 

Cleveland 

St.  Louis 

M 

$16.77 

$21.48 

$15.90 

a 

6.86 

6.28 

6.04 

13.  The  average  grades  of  sorority  and  non-sorority  women  on  a  certain 
campus  were  as  follows: 

Sorority  Group  Non-sorority  Group 

AT  =  175  AT  =  150 

ilf  =  81.23  M  =  79.62 

a  =  10.18  <7  =  9.37 


Is  the  difference  of  the  arithmetic  means  sufficient  to  conclude  that 
there  was  a  real  difference  in  the  scholarship  of  the  two  groups? 

14.  Desiring  to  test  the  milk-producing  qualities  of  two  different  kinds 
of  food,  a  dairy  association  separated,  by  a  random  selection,  800  cows 
into  two  different  herds.  All  other  conditions  were  kept  identical  as  far 
as  possible.  Observing  the  cows  for  a  certain  period,  the  following  results 
were  obtained: 
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Herd  Number  1 


Herd  Number  2 


Ni  =  400 

Ml  =  36  quarts  per  cow 
(Ti  =  5.4  quarts  per  cow 


N2  -  400 

Mi  =  40  quarts  per  cow 
<r2  =  4.5  quarts  per  cow 


Determine  whether  the  difference  between  the  average  yields  of  the  two 
herds  is  or  is  not  significant. 

15.  The  following  table  exhibits  two  frequency  distributions  relating 
to  the  earnings  of  coal  miners  in  two  different  sections  of  Illinois.  Is  the 
difference  between  their  means  sufficient  to  conclude  that  these  two  sec- 
tions do  not  belong  to  the  same  homogeneous  group? 

Pick  Miners  in  Illinois  Coal  Mines  Classified  Accord- 
ing TO  Average  Daily  Earnings,  1918-1921  ^ 


Range  of  Average 
Daily  Earnings 


Number  of  Pay  Checks 


In  21  Central  In  52  Southern 
Illinois  Mines     Illinois  Mines 


$  2.00-  2.99 


501 
1,288 
3,222 
6,293 
9,821 
13,089 
11,869 
9,484 
6,748 
4,418 
2,551 
1,304 


87 
131 
306 
563 
973 
1,530 
2,684 
5,584 
2,426 
1,433 
853 
577 
364 
197 
105 


3.00-  3.99 
4.00-  4.99 
5.00-  5.99 
6.00-  6.99 
7.00-  7.99 
8.00-  8.99 
9.00-  9.99 
10.00-10.99 
11.00-11.99 
12.00-12.99 
13.00-13.99 
14.00-14.99 
15.00-15.99 
16.00-16.99 
17.00-17.99 
18.00-18.99 
19.00-19.99 
20.00-20.99 
21.00-21.99 
22.00-22.99 
23.00-23.99 
24.00-24.99 


696 
362 
196 
115 
57 
39 
25 
16 
13 
10 
10 


71 
35 
33 
13 
6 
7 
4 
4 


Total 


72,127 


17,986 


^  See  Mills  and  Davenport,  op.  dt,  p.  107. 
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16.  Two  types  of  electric  bulbs  were  observed  as  to  the  length  of  life, 
and  the  following  data  were  secured: 


Type  1 

Ni  =  100 
Ml     1224  hours 
(Ti  =  36  hours 


Type  2 

Ni  =  100 
M2  =  1036  hours 
0-2  =  40  hours 


Is  the  difference  in  the  two  means  sufficient  to  warrant  the  conclusion 
that  Type  1  is  a  bulb  superior  to  Type  2? 

17.  A  large  number  of  men  were  measured  as  to  height  giving  Af„ 
=  68.1  inches  and  (Tu  =  2.5  inches.  How  large  a  sample  should  be  taken 
in  order  to  be  fairly  sure  (probabiUty  0.95)  that  the  sample  mean  may 
not  differ  from  the  true  mean  by  more  than  rt  0.5  inch? 

18.  The  weights  of  400  male  babies  of  same  nationality  were  analyzed. 
The  analysis  yielded  M  —  7.29  pounds  and  o-  =  1.01  pounds.  What 
statements  can  you  make  about  the  universe  mean  weight  of  babies  of 
this  nationaUty?  If  the  universe  mean  were  known  to  be  7.5  pounds, 
would  you  consider  the  above  described  sample  a  random  one? 

19.  (Treloarj  p.  143)  "Data  secured  from  the  archives  of  the  Sloane 
Hospital,  New  York  City,  for  length  of  new-born  infants  of  Irish  parents 
yielded  the  following  statistics 

Male  (X)  Female  (F) 

N  =  1,136  =  1,071 

M  =  51.96  cm.  M  =  51.22  cm. 

a  =  2.181  cm.  <r  =  2.189  cm. 


Do  these  results  justify  the  inference  that  Irish  male  offspring  are  in 
general  longer  than  females  at  birth?  Do  the  results  justify  the  inference 
that  male  babies  are  generally  less  variable  in  length  than  females  at 
birth? 

20.  The  cost  of  building  an  identical  house  in  various  parts  of  the 
United  States  in  1940  gave 

M  =  $6,029      a  =  $459        N  =  number  of  cities  =  77. 

The  cost  of  building  the  same  house  during  the  first  quarter  of  1941  gave 
M  =  $6,232      a  =  $504        N  =  number  of  cities  =  68. 

Is  this  increase  in  average-cost  significant? 

21.  The  British  Cotton  Industry  Research  Association  tested  the  break- 
ing load  on  two  types  of  yarn  with  the  following  results; 
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Type  I 

N  =  1,782 
M  =  6.83  oz. 
<r  =  1.23  oz. 


Type  II 

N  =  1,914 
M  =  7.48  oz. 
(T  =  1.33  oz. 


Is  the  difference  in  the  mean  breaking-load  significant? 

22.  Karl  Pearson  and  Alice  Lee  (Biometrikaf  Vol.  II,  p.  415)  secured 
the  measurements  of  the  stature  of  1078  fathers  and  sons.  The  analysis 
yielded  the  results: 

Fathers  Sons 

N  =  1,078  N  =  1,078 

M  =  67.70  inches  M  =  68.66  inches 

<T  —  2.72  inches  a  —  2.75  inches 

r  =  0.51 


Determine  if  the  difference  in  the  means  is  significant. 

23.  The  following  exercise  is  based  upon  data  given  in  the  Proceedings 
of  the  American  Society  for  Testing  Materials,'^  1930,  Vol.  30,  Part  II, 
pp.  448-455.  A  Committee  of  the  Society,  appointed  to  study  corrosion, 
made  numerous  studies  of  the  length  of  life  of  steel  plates  immersed  in 
city  water.  The  Committee  found  that  the  length  of  life  was  distributed 
normally.  Numerous  tests  on  No.  16  gauge  sheets  immersed  in  Washington 
tap  water  gave:  Mu  =  1940  days  and  (r„  =  224  days. 

a.  What  is  the  probability  that  the  mean  of  a  sample  of  100  sheets  will 
not  differ  more  than  25  days  from  ikfu? 

b.  Find  the  5  per  cent  level  of  significance  for  the  mean  of  a  sample  of 
100  sheets. 

c.  What  should  be  the  size  of  sample  in  future  tests  in  order  that  the 
probability  will  not  be  greater  than  of  the  sample  mean  being  in  error 
by  more  than  74  days? 

24.  The  following  item  appeared  in  the  New  York  Times  November  22, 
1942.  TALL  FRESHMEN  —  From  Yale  comes  the  news  that  the  class 
of  1945  is  the  youngest  and  tallest  that  ever  entered  the  university. 
Average  freshman  age  is  18  years,  1  month  and  11  days.  Average  height 
5  feet  8.5  inches.  Compared  with  his  predecessor  of  World  War  I  the  Yale 
freshman  of  today  is  ten  pounds  heavier  and  1.7  inches  taller.  Of  all  this 
yearns  Yale  freshmen  21.6  per  cent  (227  in  actual  numbers)  are  over 
six  feet  tall.'' 

Assuming  A''  =  1,000,  cTj^^eight  =  17  pounds  and  (^height  =  2.5  inches,  would 
you  say  the  above  item  was  noteworthy? 

26.  The  ages  of  5,317  husbands  and  wives  were  secured  and  the  analysis 
of  the  data  yielded  the  results: 
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Husbands  Wives 


N  =  5,317 
M  ==  42.8  years 
(T  =  13.1  years 


N 
M 
a 


0.91 


5,317 

40.6  years 

12.7  years 


Basing  your  judgment  on  these  data  would  you  state  that  the  difference 
in  the  means  is  significant? 
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L.  H.  C.  Tippett,  The  Methods  of  Statistics,  3rd  edition,  Williams  and 
Norgate,  London,  1941.  This  book  is  mainly  one  of  interpreta- 
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ton Mifflin  Company,  1924.  A  useful  reference  book  for  one  who 
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APPENDIX  B 
AREAS  AND  ORDINATES  OF  THE  NORMAL  CURVE 


F.    SOTJRCES  FOR  CURRENT  STATISTICAL  DaTA 


0(0  = 


V27r 


1 


The  following  table  gives  the  values  of  the  area  under  the  curve 
from  the  ordinate  at  <  =  0  to  the  ordinate  for  the  values  of  t  given 
in  the  column  at  the  left.   Values  of  the  ordinate  are  also  given. 
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t 

•^9  Jo 

t 

(l>(t) 

t 

t 

(bit) 

.00 

.0000 

.3989 

.40 

.1554 

.3683 

.80 

.2881 

.2897 

1.20 

.3849 

.1942 

.01 

.0040 

.3989 

.41 

.1591 

.3668 

.81 

.2910 

.2874 

1.21 

.3869 

.1919 

.02 

.0080 

.3989 

.42 

.1628 

.3653 

.82 

.2939 

.2850 

1.22 

.3888 

.1895 

.03 

.0120 

.3988 

.43 

.1664 

.3637 

.83 

.2967 

.2827 

1.23 

.3907 

U872 

.04 

.0160 

.3986 

.44 

.1700 

.3621 

.84 

.2996 

.2803 

1.24 

.3925 

.1849 

.05 

.0199 

.3984 

.45 

.1736 

.3605 

.85 

.3023 

.2780 

1.25 

.3944 

.1827 

.06 

.0239 

.3982 

.46 

.1772 

.3589 

.86 

.3051 

.2756 

1.26 

.3962 

.1804 

.07 

.0279 

.3980 

.47 

.1808 

.3572 

.87 

.3079 

.2732 

1.27 

.3980 

.1781 

.08 

.0319 

.3977 

.48 

.1844 

.3555 

.88 

.3106 

.2709 

1.28 

.3997 

.1759 

.09 

.0359 

.3973 

.49 

.1879 

.3538 

.89 

.3133 

.2685 

1.29 

.4015 

.1736 

.10 

.0398 

.3970 

.50 

.1915 

.3521 

.90 

.3159 

.2661 

1.30 

.4032 

.1714 

.11 

.0438 

.3965 

.51 

.1950 

.3503 

.91 

.3186 

.2637 

1.31 

.4049 

.1692 

.12 

.0478 

.3961 

.52 

.1985 

.3485 

.92 

.3212 

.2613 

1.32 

.4066 

.1669 

.13 

.0517 

.3956 

.53 

.2019 

.3467- 

.93 

.3238 

.2589 

1.33 

.4082 

.1647 

.14 

.0557 

.3951 

.54 

.2054 

.3448 

.94 

.3264 

.2565 

1.34 

.4099 

.1626 

.15 

.0596 

.3945 

.55 

.2088 

.3429 

.95 

.3289 

.2541 

1.35 

.4115 

.1604 

.16 

.0636 

.3939 

.56 

.2123 

.3411 

.96 

.3315 

.2516 

1.36 

.4131 

.1582 

.17 

.0675 

.3932 

.57 

.2157 

.3391 

.97 

.3340 

.2492 

1.37 

.4147 

.1561 

.18 

.0714 

.3925 

.58 

.2190 

.3372 

.98 

.3365 

.2468 

1.38 

.4162 

.1540 

.19 

.0754 

.3918 

.59 

.2224 

.3352 

.99 

.3389 

.2444 

1.39 

.4177 

.1518 

.20 

.0793 

.3910 

.60 

.2258 

.3332 

1.00 

.3413 

.2420 

1.40 

.4192 

.1497 

.21 

.0832 

.3902 

.61 

.2291 

.3312 

1.01 

.3438 

.2396 

1.41 

.4207 

.1476 

.22 

.0871 

.3894 

.62 

.2324 

.3292 

1.02 

.3461 

.2371 

1.42 

.4222 

.1456 

.23 

.0910 

.3885 

.63 

.2357 

.3271 

1.03 

.3485 

.2347 

1.43 

.4236 

.1435 

.24 

.0948 

.3876 

.64 

.2389 

.3251 

1.04 

.3508 

.2323 

1.44 

.4251 

.1415 

.25 

.0987 

.3867 

.65 

.2422 

.3230 

1.05 

.3531 

.2299 

1.45 

.4265 

.1394 

.26 

.1026 

.3857 

.66 

.2454 

.3209 

1.06 

.3554 

.2275 

1.46 

.4279 

.1374 

.27 

.1064 

.3847 

.67 

.2486 

.3187 

1.07 

.3577 

.2251 

1.47 

.4292 

.1354 

.28 

.1103 

.3836 

.68 

.2518 

.3166 

1.08 

.3599 

.2227 

1.48 

.4306 

.1334 

.29 

.1141 

.3825 

.69 

.2549 

.3144 

1.09 

.3621 

.2203 

1.49 

.4319 

.1315 

.30 

.1179 

.3814 

.70 

.2580 

.3123 

1.10 

.3643 

.2179 

1.50 

.4332 

.1295 

.31 

.1217 

.3832- 

.71 

.2612 

.3101 

1.11 

.3665 

.2155 

1.51 

.4345 

.1276 

.32 

.1255 

.3790 

.72 

.2642 

.3079 

1.12 

.3686 

.2131 

1.52 

.4357 

.1257 

.33 

.1293 

.3778 

.73 

.2673 

.3056 

1.13 

.3708 

.2107 

1.53 

.4370 

.1238 

.34 

.1331 

.3765 

.74 

.2704 

.3034 

1.14 

.3729 

.2083 

1.54 

.4382 

.1219 

.35 

.1368 

.3752 

.75 

.2734 

.3011 

1.15 

.3749 

.2059 

1.55 

.4394 

.1200 

.36 

.1406 

.3739 

.76 

.2764 

.2989 

1.16 

.3770 

.2036 

1.56 

.4406 

.1182 

.37 

.1443 

.3726 

.77 

.2794 

.2966 

1.17 

.3790 

.2012 

1.57 

.4418 

.1163 

.38 

.1480 

.3712 

.78 

.2823 

.2943 

1.18 

.3810 

.1989 

1.58 

.4430 

.1145 

.39 

.1517 

.3697 

.79 

.2852 

.2920 

1.19 

.3830 

.1965 

1.59 

.4441 

.1127 
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t 

(bit) 

t 

t 

t 

1.60 

.4452 

.1109 

2.00 

.4773 

.0540 

2.40 

.4918 

.0224 

2.80 

.4974 

.0079 

1.61 

.4463 

.1092 

2.01 

.4778 

.0529 

2.41 

.4920 

.0219 

2.81 

.4975 

.0077 

1.62 

.4474 

.1074 

2.02 

.4783 

.0519 

2.42 

.4922 

.0213 

2.82 

.4976 

.0075 

1.63 

.4485 

.1057 

2.03 

.4788 

.0508 

2.43 

.4925 

.0208 

2.83 

.4977 

.0073 

1.64 

.4495 

.1040 

2.04 

.4793 

.0498 

2.44 

.4927 

.0203 

2.84 

.4977 

.0071 

1.65 

.4505 

.1023 

2.05 

.4798 

.0488 

2.45 

.4929 

.0198 

2.85 

.4978 

.0069 

1.66 

.4515 

.1006 

2.06 

.4803 

.0478 

2.46 

.4931 

.0194 

2.86 

.4979 

.0067 

1.67 

.4525 

.0989 

2.07 

.4808 

.0468 

2.47 

.4932 

.0189 

2.87 

.4980 

.0065 

1.68 

.4535 

.0973 

2.08 

.4812 

.0459 

2.48 

.4934 

.0184 

2.88 

.4980 

.0063 

1.69 

.4545 

.0957 

2.09 

.4817 

.0449 

2.49 

.4936 

.0180 

2.89 

.4981 

.0061 

1.70 

.4554 

.0941 

2.10 

.4821 

.0440 

2.50 

.4938 

.0175 

2.90 

.4981 

.0060 

1.71 

.4564 

.0925 

2.11 

.4826 

.0431 

2.51 

.4940 

.0171 

2.91 

.4982 

.0058 

1.72 

.4573 

.0909 

2.12 

.4830 

.0422 

2.52 

.4941 

.0167 

2.92 

.4983 

.0056 

1.73 

.4582 

.0893 

2.13 

.4834 

.0413 

2.53 

.4943 

.0163 

2.93 

.4983 

.0055 

1.74 

.4591 

.0878 

2.14 

.4838 

.0404 

2.54 

.4945 

.0159 

2.94 

.4984 

.0053 

1.75 

.4599 

.0863 

2.15 

.4842 

.0396 

2.55 

.4946 

.0155 

2.95 

.4984 

.0051 

1.76 

.4608 

.0848 

2.16 

.4846 

.0387 

2.56 

.4948 

.0151 

2.96 

.4985 

.0050 

1.77 

.4616 

.0833 

2.17 

.4850 

.0379 

2.57 

.4949 

.0147 

2.97 

.4985 

.0049 

1.78 

.4625 

.0818 

2.18 

.4854 

.0371 

2.58 

.4951 

.0143 

2.98 

.4986 

.0047 

1.79 

.4633 

.0804 

2.19 

.4857 

.0363 

2.59 

.4952 

.0139 

2.99 

.4986 

.0046 

1.80 

.4641 

.0790 

2.20 

.4861 

.0355 

2.60 

.4953 

.0136 

3.00 

.4987 

.0044 

1.81 

.4649 

.0775 

2.21 

.4865 

.0347 

2.61 

.4955 

.0132 

3.01 

.4987 

.0043 

1.82 

.4656 

.0761 

2.22 

.4868 

.0339 

2.62 

.4956 

.0129 

3.02 

.4987 

.0042 

1.83 

.4664 

.0748 

2.23 

.4871 

.0332 

2.63 

.4957 

.0126 

3.03 

.4988 

.0041 

1.84 

.4671 

.0734 

2.24 

.4875 

.0325 

2.64 

.4959 

.0122 

3.04 

.4988 

.0039 

1.85 

.4678 

.0721 

2.25 

.4878 

.0317 

2.65 

.4960 

.0119 

3.05 

.4989 

.0038 

1.86 

.4686 

.0707 

2.26 

.4881 

.0310 

2.66 

.4961 

.0116 

3.06 

.4989 

.0037 

1.87 

.4693 

.0694 

2.27 

.4884 

.0303 

2.67 

.4962 

.0113 

3.07 

.4989 

.0036 

1.88 

.4700 

.0681 

2.28 

.4887 

.0297 

2.68 

.4963 

.0110 

3.08 

.4990 

.0035 

1.89 

.4706 

.0669 

2.29 

.4890 

.0290 

2.69 

.4964 

.0107 

3.09 

.4990 

.0034 

1.90 

.4713 

.0656 

2.30 

.4893 

.0283 

2.70 

.4965 

.0104 

3.10 

.4990 

.0033 

1.91 

.4719 

.0644 

2.31 

.4896 

.0277 

2.71 

.4966 

.0101 

3.11 

.4991 

.0032 

1.92 

.4726 

.0632 

2.32 

.4898 

.0271 

2.72 

.4967 

.0099 

3.12 

.4991 

.0031 

1.93 

.4732 

.0620 

2.33 

.4901 

.0264 

2.73 

.4968 

.0096 

3.13 

.4991 

.0030 

1.94 

.4738 

.0608 

2.34 

.4904 

.0258 

2.74 

.4969 

.0094 

3.14 

.4992 

.0029 

1.95 

.4744 

.0596 

2.35 

.4906 

.0252 

2.75 

.4970 

.0091 

3.15 

.4992 

.0028 

1.96 

.4750 

.0584 

2.36 

.4909 

.0246 

2.76 

.4971 

.0089 

3.16 

.4992 

.0027 

1.97 

.4756 

.0573 

2.37 

.4911 

.0241 

2.77 

.4972 

.0086 

3.17 

.4992 

.0026 

1.98 

.4762 

.0562 

2.38 

.4913 

.0235 

2.78 

.4973 

.0084 

3.18 

.4993 

.0025 

1.99 

.4767 

.0551 

2.39 

.4916 

.0229 

2.79 

.4974 

.0081 

3.19 

.4993 

.0025 
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t 

6(t) 

t 

<l>it) 

/ 

6(t) 

t 

■  ■ '  

6(0 

3.20 

.4993 

,0024 

3.50 

Am 

.0009 

3,80 

Am 

,0003 

4.10 

.5000 

.0001 

3.21 

.4993 

,0023 

3.51 

.4998 

.0008 

3.81 

Am 

.0003 

4.11 

.5000 

.0001 

3.22 

.4994 

.0022 

3.52 

.4998 

.0008 

3.82 

Am 

.0003 

4.12 

.5000 

.0001 

3.23 

.4994 

.0022 

3.53 

.4998 

.0008 

3.83 

Am 

,0003 

4.13 

,5000 

.0001 

3.24 

.4994 

.0021 

3.54 

.4998 

.0008 

3.84 

Am 

,0003 

4.14 

,6000 

.0001 

3.25 

.4994 

.0020 

3.55 

.4998 

,0007 

3.85 

.4999 

,0002 

4,15 

,5000 

.0001 

3.26 

,4994 

.0020 

3.56 

.4998 

.0007 

3.86 

.4999 

.0002 

4,16 

.5000 

,0001 

3.27 

.4995 

.0019 

3.57 

.4998 

.0007 

3.87 

.5000 

.0002 

4,17 

.5000 

,0001 

3.28 

.4995 

,0018 

3.58 

.4998 

.0007 

3.88 

.5000 

.0002 

4,18 

.5000 

.0001 

3,29 

.4995 

.0018 

3.59 

.4998 

.0006 

3.89 

.5000 

.0002 

4.19 

.5000 

.0001 

3.30 

.4995 

.0017 

3.60 

.4998 

,0006 

3.90 

.5000 

.0002 

4.20 

.5000 

.0001 

3.31 

.4995 

.0017 

3.61 

.4999 

.0006 

3.91 

.5000 

.0002 

4.21 

.5000 

.0001 

3.32 

.4996 

.0016 

3.62 

.4999 

.0006 

3.92 

.5000 

.0002 

4.22 

.5000 

.0001 

3.33 

,4996 

.0016 

3.63 

.4999 

.0006 

3.93 

,5000 

.0002 

4.23 

.5000 

.0001 

3.34 

,4996 

,0015 

3.64 

.4999 

.0005 

3.94 

,5000 

.0002 

4.24 

.5000 

.0001 

3.35 

,4996 

,0015 

3.65 

.4999 

.0005 

3,95 

.5000 

.0002 

4,25 

.5000 

.0001 

3.36 

.4996 

.0014 

3.66 

.4999 

.0005 

3,96 

.5000 

.0002 

4.26 

.5000 

.0001 

3.37 

.4996 

.0014 

3.07 

.4999 

.0005 

3,97 

.5000 

.0002 

4.27 

.5000 

.0000 

3.38 

.4996 

,0013 

3.68 

.4999 

.0005 

3,98 

.5000 

.0001 

4.28 

.5000 

.0000 

3.39 

.4997 

,0013 

3,69 

.4999 

.0004 

3,99 

.5000 

.0001 

4.29 

.5000 

.0000 

3.40 

.4997 

.0012 

3,70 

.4999 

.0004 

4,00 

.5000 

.0001 

3.41 

.4997 

.0012 

3,71 

.4999 

.0004 

4.01 

.5000 

.0001 

3.42 

.4997 

.0012 

3,72 

.4999 

.0004 

4.02 

.5000 

.0001 

3.43 

.4997 

.0011 

3,73 

.4999 

.0004 

4.03 

.5000 

.0001 

3.44 

,4997 

.0011 

3,74 

.4999 

.0004 

4.04 

,5000 

.0001 

3.45 

,4997 

.0010 

3,75 

,4999 

.0004 

4.05 

.5000 

.0001 

3.46 

.4997 

.0010 

3.76 

.4999 

.0003 

4.06 

.5000 

.0001 

3.47 

.4997 

.0010 

3.77 

.4999 

.0003 

4.07 

,5000 

.0001 

3.48 

.4998 

.0009 

3.78 

,4999 

,0003 

4.08 

,5000 

.0001 

3.49 

.4998 

.0009 

3.79 

.4999 

,0003 

4.09 

,5000 

.0001 

APPENDIX  C 
TABLES  OF  LOGARITHMS  AND  ANTILOGARITHMS 


FOUR-PLACE  LOGARITHMS 


N 

0 

1 

a 

3 

4 

5 

6 

7 

8 

9 

10 

0000 

0043 

0086 

0128 

0170 

0212 

0253 

0294 

0334 

0374 

11 

0414 

0453 

0492 

0631 

0669 

0607 

0646 

0682 

0719 

0755 

12 

0792 

0828 

0864 

0899 

0934 

0969 

1004 

1038 

1072 

1106 

13 

1139 

1173 

1206 

1239 

1271 

1303 

1336 

1367 

1399 

1430 

14 

1461 

1492 

1523 

1663 

1684 

1614 

1644 

1673 

1703 

1732 

15 

1761 

1790 

1818 

1847 

1875 

1903 

1931 

1959 

1987 

2014 

16 

2041 

2068 

2095 

2122 

2148 

2175 

2201 

2227 

2253 

2279 

17 

2304 

2330 

2355 

2380 

2405 

2430 

2455 

2480 

2504 

2529 

18 

2563 

2677 

2601 

2625 

2648 

2672 

2095 

2718 

• 

2742 

2766 

19 

2788 

2810 

2833 

2856 

2878 

2900 

2923 

2945 

2967 

2989 

20 

3010 

3032 

3054 

3075 

3096 

3118 

3139 

3160 

3181 

3201 

21 

3222 

3243 

3263 

3284 

3304 

3324 

3345 

3365 

'  3385 

3404 

22 

3424 

3444 

3464 

3483 

3502 

3522 

3541 

3560 

3579 

3598 

23 

3617 

3636 

3655 

3674 

3692 

3711 

3729 

3747 

3766 

3784 

24 

3802 

3820 

3838 

3856 

3874 

3892 

3909 

3927 

3945 

3962 

25 

3979 

3997 

4014 

4031 

4048 

4065 

4082 

4099 

4116 

4133 

26 

4160 

4166 

4183 

4200 

4216 

4232 

4249 

4265 

4281 

4298 

27 

4314 

4330 

4346 

4362 

4378 

4393 

4409 

4425 

4440 

4456 

28 

4472 

4487 
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CHAPTER  1 


Page  8 


12     22     32     42  n2 ' 

2.  21  +  22  +  23  +  2^  +  •  •  •  +  2«. 

3.  (1  -  3)  +  (2  -  3)  +  (3  -  3)  +  (4  -  3)  +  •  •  •  +  (n  -  3). 


Page  10 

4.  (1)  1,911.    (2)  6,635. 
Pages  13-14 

I.  a.  2a;  =  0,  (Sa;)2  =  0,  2x2  =  1,308,  =  36.16. 

b.  217  -  200,  2[/2  -  5,486,  2X  =  700,  2^2  =  50,486. 

c.  2X2  ^  220,  2  F2  =  275,  (2X)(2F)  =  1,050,  2X7  =  176. 

d.  2a:  -  0,  2?/  =  0,  2^2  =  40,  Xxy  =  -  34,  |^  =  -  0.85. 

Pages  17-18 
1.  (1)  4.    (2)  3.    (3)  2.    (4)  3.    (5)  5. 

3.  0.00004.  4.  0.00004.  5.  2%.  6.  0.147%.  7.  0.04%. 
8.  5.165  X  10«;  about  0.01%.         10.  (1)  2,142.     (2)  2,774. 

II.  (1)  178.55.    (2)  178.55.  12.  (1)  310.53.    (2)  310.53. 

Pages  21-22 

1.  363  db  0.5.      2.  24,725  ±  87.5.      3.        ±  0.112.      4.  4.05  sq.  ft. 
7.  The  former.         11.  W  +  4n2  +  3n.         12.  J[4n8  +  33n2  ^  gg^]. 
13.  42,540.    15.  (1)  8,888.    (2)  123,464.    17.  (1)  154,198.    (2)  109,802. 
n(n+  l)(2n  +  4)  n(n  +  l)(2n  +  7) 


n  n 


9.  2a;(a:  -f  1).  10.  2(X»  -  My, 


6 


6 

n(n  +  l)(n  +  2)(3n  +  5) 
12 


21.  24,001,875. 


22. 


475 
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23.  $5.13  per  ton;  0.000192. 

26.  $469,098,000;  0.178%. 

27.  $330.75. 

29.  20%.  30.  9%. 

32.  93.5%  of  value  in  1933. 


24.  $150,000,000;  3.8%. 

26.  105.9  bu.;  0.29%. 

28.  A:  39.37%;  B:  37.25%. 

31.  $187.50. 
33.  5.3  persons;  0.004. 


CHAPTER  3 


Pages  68-71 

1.  M  =  7.5. 

8.  Mx  =  27. 

11.  Mx  =  260. 

14.  Mx  =  448. 

Pages  74-75 

1.  M  =  67.42  inches. 


2.  M  =  1.956. 

9.  Mx  =  360. 

12.  Mx  =  537.3. 

16.  Mx  =  20.2. 


3.  Af  =  53.7. 

10.  Mx  =  237.25. 

13.  Mx  =  206.25. 

22.  Mx  =  53.7. 


2.  M  =  139.39  pounds.      3.  M  =  6.06  inches. 


5.  $31.87. 

6. 

$35.08. 

11.  22  cents. 

12.  $16.77;  $21.48;  $15.90. 

13. 

1,000  lbs.  per  sq.  in. 

Page  79 

1.  Md  =  1.98  cm. 

2. 

Md  =  $449.50. 

3.  a.  Md  = 

67.52  inches. 

4. 

$15.71;  $21.89;  $15.32. 

b.  Md  = 

138.12  pounds. 

5. 

Md  =  991.3  lbs.  per  sq.  in, 

7.  22. 

8. 

Md  =  53.7  rays. 

Page  85 

King 

Parabola 

Pearson 

Unit 

1.  1.998 

1.997 

2.02 

cm. 

2.  53.07 

53.34 

53.64 

rays 

3.  68.00 

68.01 

67.72 

in. 

4.  6.01 

6.02 

6.03 

in. 

Page  91 

a.  3.27%. 

b.  3.45%. 

c.  2.5%. 

Pages  91-92 

1.  Mg  =  17,043. 

2.  9.32%,. 

3.  20.5%;  $795;  $632.03; 

$502.46. 

4.  2.62%. 

Pages  97-98 

1.  26  mi.  per  day.  2.  66ij4  per  bu.       3.  15ff  per  gal. 
6.  a.  9f  units  per  hr.  6.  45  mi.  per  hr. 

b.  6J  min.  7.  a.  8  problems  per  hr. 

c.  384.  b.  7.5  min.  per  problem. 


4.  24§  days. 
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Pages  101-10 

2.  86%.  3.  5.8%.  4.  a.  Af  =  149.799.   b.  ilf  =  150.77. 

5.  a.  Md  =  151.09.    b.  Ma  =  153.73. 

6.  a.  Mo  =  165.46.    b.  Mo  -  164.74. 

7.  a.  0.154.  b.  M,  =  8,535.6.  S.  M  =  38.1  years. 
9.  M  =  6.9  years.         =  4.1  years. 

10.  4.76%.  Estimated  values:  42,222;  53,278;  67,224;  170,428. 

11.  a.  6.0759  pounds,    b.  16.46iii  per  lb. 

14.  11.12^  per  lb.;  8.99jzi  per  lb. 

15.  a.  56.32.    b.  41.92.  16.  a.  56.25.   b.  41.95. 
17.  166.6  years.  18.  542.9  millions. 

19.  Group  A:  Md  =  $7.54;  Mx  =  $7,395. 
Group  B:  Md  ==  $7.54;  Mx  =  $7,015. 

20.  1.3%;  11,085,900.  21.  6(10)«. 

22.  21.4%;  $485.58.  23.  M  =  648.7;  M,  =  399.1. 

24.  8.56%.        25.  7.875%;  2,120,350  ;  4.93%.        26.  About  7.5  years. 

27.  4.99%.        28.  ^  units  per  minute.  29.  22  cents. 

31.  M  =  $1,371.72;  Ma  =  $1,365.37;  Mo  =  $1,432.83. 

32.  M  =  3.39%.  33.  108.9  millions  of  barrels. 

35.  Cincinnati:  $12.46;  Cleveland:  $22.33;  St.  Louis:  $15.72. 

36.  M  =  $1,280.01;  Md  =  $1,264.07;  Mo  =  $1,266.92. 

37.  10.6%.  38.  1,267.9  millions  of  dollars. 
39.  At  an  infinite  speed.                 42.  np, 

43.  M  =  — i-r[2"+i  -  1];  M,  =  22;  Mh  =  ^  ^ 


n  +  1 

Page  114 
1.  27%. 

5.  7.5  to  8  inches. 
7.  M  ^  Md  =  Mo 


2«+i  —  1 


CHAPTER  4 


2.  47%. 


3.  No.  4.  16%;  51%. 

6.  50  to  55  pounds. 
35  pounds  for  A  and  B.  No. 


Page  119 

1.  7.2%. 

2.  a. 

Qi  =  Go. So  inches. 
Qi  =  69.06  inches. 
Q  =  l.Q  inches. 
V,  =  2.4%. 

4. 


b. 

Q 


127.19  pounds. 
149.63  pounds. 
11.22  pounds. 
8.1%. 


D, 

Do 

D, 

2)6 

Ds 

D, 

a. 

90.8 

114.4 

128.0 

139.4 

151.1 

162.3 

171.4 

186.8 

203.8 

b. 

92.2 

114.5 

129.0 

140.6 

153.7 

163.0 

171.2 

187.3 

205.3 
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6.  Absolute:  De  —  Z>4,  D7  —  D3,  etc.  Relative:  S  ^       S  etc. 

7.  6.731,  5.859,  6.192  inches.  8.  Qi  «  55.6,  Qs  =  75.4. 
9.  Cincinnati:  $12.17  to  $20.17. 

Cleveland:  $17.89  to  $25.48. 
St.  Louis:    $12.27  to  $18.57. 

10.  Qi  =  5.94  inches,  Q3  =  6.19  inches,  Q  =  0.125  inch.  No. 

11.  Qi  =  52.2  rays,  Q3  =  55.1  rays. 

12.  $1,197.00  to  $1,506.72.  13.  No. 


Page  124 

1.  a.  0,  86,  14.3. 

b.  0,  56,  9.3. 

c.  0,  156,  26.0. 

3.  M  =»  69.5;  M.D,  about  M  = 


2.  Wheat:  1.08  bu.  per  acre. 

Rye:  1.3  bu.  per  acre. 

Oats:  2.03  bu.  per  acre. 
11.32;  58.1%.  4.  No. 


Pages  127-28 

1.  Wheat:  M  = 
Rye:  M  = 
Oats:     M  = 

3.  M  -  20,  (T  - 


14.34  bu.  per  acre;  <^ 

12.15  bu.  per  acre;  a 

30.27  bu.  per  acre;  or 
6.45 


=  1.23  bu.  per  acre. 
=  1.6  bu.  per  acre. 
=  2.4  bu.  per  acre. 


Page  133 

2.  Cincinnati:  <t  =  $6.86,  M  =  $16.77. 
Cleveland:  <r  =  $6.28,  M  =  $21.48. 
St.  Louis:    <T  =  $6.04,  M  =  $15.90. 

3.  M  =  $1,280.01,  (T  =  $150.01. 

4.  M  =  $1,371.72,  a  =  $247.03,      =  18%. 

5.  A:  M  =  100.95,  M4  =  101.02,       =  101.1,  cr  =  13. 
B:  M  =  47.71,  Md  =  47.68,  Mo  =  48.08,  a  =  5.88. 


Page  139 
4.  a. 


1st 
100 

2nd 
100 

Srd 
100 

100 

5th 
100 

eth 

100 

7th 
100 

8th 
100 

9th 
100 

100 

M 

142.35 

138.75 

138.65 

139.05 

138.35 

139.35 

137.05 

138.75 

140.65 

138.55 

139.15 

22.8 

20.1 

19.1 

18.2 

14.9 

17.2 

16.2 

16.8 

17.7 

15.4 

18.03 

O.  (Tmeana  -—  1.36  pounds,  ilfmeon.  ==  139.15  pounds. 
C.  Seven.  d.  They  are  equal. 
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Pages  147-49 

1.  a  =  0.20  inch,  M,D.  =  0.159  inch. 

2.  (T  =  0.19  cm.,  49.3,  Yes,  M  =  1.956  cm. 

3.  ilf  =  5,  <T  =  1.58,  971. 

4.  a.  M  =  67.42  inches,  a  =  2.43  inches,      ==  0.036. 
b.  M  =  139.39  pounds,  a  =  17.2  pounds,       =  0.12. 

5.  a.  M  =  149.8,  <t  =  42.5,       =  0.28. 
b.  M  =  150.8,  (7  =  42.2,      =  0.28. 

6.  a.  M  ==  56.323  mm.,  a  =  2.404  mm.,  7^  =  0.043. 
b.  M  =  41.917  mm.,  <r  =  1.385  mm.,  7^  =  0.033. 

7.  b.  iV3CP^).  9.  17. 

10.  Q  =  5.3,  Qi  =  66.7,  Qa  =  77.3,  Mo  =  72,  Ex  =  5.3,      =  0.17,  Ea  -  0.12 

11.  a.  Em  =  0.04  inch,  E^  =  0.03  inch, 
b.  Em  =  0.3  pound,       =  0.2  pound. 

12.  Em  =  0.004  inch,  E^  =  0.003  inch. 

13.  V^^  =  0.22,  F^,  =  0.26. 

14.  Scores:  Em  =  0.37,    Production:  Em  ==  0.97. 

15.  (1)  200.    (2)  285,  $90  and  $30. 

16.  The  first  distribution.  17.  The  distribution  of  weights. 
27.  M  =  np,  0-  =  Vnpq.                   28.  X  =         =  3/^. 


Page  154 


A 

B 

C 

D 

M 

20 

26.7 

13.3 

9.75 

Ma 

20 

28.96 

11.04 

7.5 

7.2 

7.25 

7.25 

6.42 

Sk 

0 

-0.93 

0.93 

1.05 

CHAPTER  5 

Pages  167-68 

1.  0.0083.  2.  a.  -  0.125.  b.  0.220.  3.  a.  -  0.047.  b.  0.026. 
4.  The  unadjusted  values  are: 


a 

h 

M 

149.8 

150.77 

42.5 

42.2 

as 

-  0.05 

-  0.08 

2.65 

2.58 
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6. 


Unadjusted  Values 

Adjusted  Values 

a 

h 

a 

h 

M 

56.322 

41.917 

56.322 

41.917 

(T 

2.404 

1.385 

2.386 

1.378 

OLZ 

0.603 

0.128 

0.616 

0.130 

"4 

4.334 

2.8904 

4.373 

2.8832 

7.  M  =  6.062  inches,  a  =  0.20  inch. 


10. 


Unadjusted 

Adjusted 

M 

39.835 

39.835 

Md 

39.831 

Mo 

39.802 

<T 

2.052 

2.032 

«3 

0.0287 

0.0296 

11. 


Unadjusted 

Adjusted 

M 

171.8917 

171.8917 

Ma 

171.8858 

Mo 

173.407 

(J 

6.8236 

6.7992 

0.1125 

0.1137 

3.194 

3.197 

CHAPTER  6 

Pages  177-78 


1. 


Year 

Relatives 
to  1909 

Year 

Relatives 
to  1909 

1909 

100 

1916 

90 

1910 

90 

1917 

80 

1911 

83 

1918 

72 

1912 

88 

1919 

78 

1913 

86 

1920 

76 

1914 

84 

1921 

61 

1915 

83 

1922 

71 

Year 

Relatives 
to  1909 

Link 
Relatives 

Year 

Relatives 
to  1909 

Link 
Relatives 

1909 

100 

•  •  • 

1914 

138 

106 

1910 

106 

106 

1915 

151 

109 

1911 

111 

105 

1916 

160 

106 

1912 

120 

108 

1917 

175 

110 

1913 

130 

108 

1918 

190 

109 
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Rel 

Rel 

1910 

67 

1920 

87 

1911 

Jtk.  %y  JL  -M- 

69 

1921 

87 

1912 

72 

1922 

A  %y  am  mm 

101 

1913 

80 

1923 

120 

1914 

77 

1924 

130 

1915 

75 

1925 

141 

1916 

80 

1926 

144 

1917 

81 

1927 

151 

1918 

62 

1928 

154 

1919 

71 

1929 

150 

4. 


Year 

Rel. 

{Corn) 

Rel. 
{Hogs) 

Average 
Price  KeL 

1920 

96 

138 

117 

1921 

60 

84 

72 

1922 

94 

91 

93 

1923 

103 

75 

89 

1924 

140 

80 

110 

1925 

96 

117 

107 

1926 

91 

122 

107 

1927 

103 

99 

101 

1928 

107 

91 

99 

1929 

111 

100 

106 

Pages  183-84 
1. 


Year 

1921 

1923 

1925 

1927 

1929 

Aggregative 
Relative 

100 

104.5 

103 

99.6 

96.9 

2. 


Year 

1921 

1923 

1925 

1927 

1929 

Harmonic 
Mean  of 
Relatives 

100 

119 

134 

127 

128 

482 
3. 
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Year 

1921 

ms 

1926 

1927 

1929 

Aggregative 
Relative 

100 

117 

138 

123 

124 

Arithmetic 
Mean  of  Rel. 

100 

123 

137 

132 

132 

ivieaian  oi 
0-  Relatives 

100 

116 

140 

122 

125 

Geometric 
Mean  of  Rel. 

100 

121 

136 

130 

130 

Harmonic 
Mean  of  Rel. 

100 

119 

134 

127 

128 

4. 


0 

Year 

1921 

1923 

1926 

1927 

1929 

Arith.  Mean 

100 

105 

102 

115 

108 

Median 

100 

100 

95 

110 

99 

Geo.  Mean. 

100 

105 

100 

113 

102 

Page  187 
1. 


Year 

1915 

1918 

1920 

1922 

a.  Simple  Agg. 
Rel. 

107 

179 

225 

370 

b.  Simple  A.M. 
of  Rel. 

103 

168 

231 

144 

c.  Weighted  Agg. 
Rel. 

•  •  • 

217 

140 

Page  196 
1.  146. 


3. 


2.  151. 


1921 

1929 

(1) 

96.7 

119.5 

(2) 

91.4 

117.3 

(3) 

94.0 

118.4 

(4) 

97.4 

138.9 

(5) 

91.3 

117.8 
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Page  200 

6.  150.6.  7.  94.0;  118.3. 


8. 


1921 

1923 

(1) 

89.7 

94.7 

(2) 

91.4 

90.4 

(3) 

90.5 

92.5 

(4) 

79.7 

124.1 

(5) 

90.5 

92.2 

9.  147. 


CHAPTER  7 

Page  205 

1.  2.5.  3.  -  4.5.  5.  0.  9.  3. 

2.  1.  4.  -  V^.  6.  0.  10.  -  2, 


Pages  209-10 

1.  y  =  3X  +  2. 

3.  y  =  -  3X  -f  2. 

6.  a.  Slope-intercept  form,  m 

6.  a.  y  =  5X  -  7.      b.  2y 

8.  7Z  -  5y  +  4  =  0,  -  ^,  t. 

14.  4X  +  3y  =  17. 


2.  y.  =  3X  -  2. 

4.  y  -  -  3X  -  2. 
3,  5  =  -  4. 

3X  +  10.     c.  y  =  -  X  +  8. 

13.  y  -  3Z  -  1. 

15.  Yes.  16.  No 


Page  216 

y  =  2.975X  -  2.025. 


Pages  220-21 

1.  J?  =  0.02799^  +  10.122.  2.  I  =  0.02w;  +  90.22. 


3. 


10 

I 

w 

50 

91.22 

350 

97.22 

100 

92.22 

400 

98.22 

150 

93.22 

450 

99.22 

200 

94.22 

500 

100.22 

250 

95.22 

550 

101.22 

300 

96.22 

600 

102.22 

484 


ANSWERS  TO  EXERCISES 


Pages  225-26 

1.  a.  7  ==  0,5X  +  1. 
b.  y  =  2.6X  -  2. 

2.  S  =  0.52r  +  54.2,  80.2. 
4.  a.  If  =  1.02L  -  3.123. 


c.  r  =  -  0.85X  +  12.1. 

d.  r  =  -  1.5X  +  50. 

3.   y  =  0.765X  -f  22.9,  68.8. 

5.  L  =  -  0.675r  +  603.5,  552.875. 


T 

Observed 

Computed 

L 

L 

70 

556 

556.25 

80 

550 

549.5 

90 

542 

542.75 

100 

536 

536.00 

110 

530 

529.25 

120 

523 

522.5 

130 

515 

515.75 

Pages  229-30 

1.  y  =  5.75X  +  111.475  with  X  =  0  at  1919. 

2.  (1)  y  =  2.65X  +  18.304.    (2)  $28.90. 

3.  (1)  y  =  3.483X  +  26.44.    (2)  $43.86  millions. 

4.  y  =  -  0.12X  +  5.2. 


CHAPTER  8 

Pages  236-37 

3.  (1)  y  =  -  0.46X  +  5.53. 

(3)  Sy  =  0.53  thousand  strikes  and  lockouts. 

(4)  1.85  thousand  strikes  and  lockouts. 

(5)  Computed  Y  =  0.65. 

4.  (1)  y  =  9.82X  -  29.7. 

(3)  Sy  =  $17.6. 

(4)  Computed  Y  =  $68.5. 
6.  (1)  y  ==  ll.lX  -  217.83. 

(3)  Sy  =  $45.2. 

(4)  Computed  Y  =  $226.2. 


Page  243 

1.  r  =  0.95,  y  = 

2.  r  =  0.89,  y  = 

3.  r  =  0.61,  y  = 


1.02X  -  12.62,  Sy 
0.075X  +  4.72,  Sy 
3.32X  +  21.93,  Sy 


=  3.93  c.u. 
=  0.58  ton. 
=  3.3  bu.  per  acre. 
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Pages  250-52 

1.  r  =  -  0.92.  2.  r  =  0.61. 

6.  r  =  0.95.  6.  r  =  -  0.84.  No. 

10.  cTj^  =  (Ty  =  3.42.   r  =  m  =  l.    y  =  Z  +  4. 

11.  b.    (Tx  = 

r  = 
m  = 

- 12.  y  = 

12.  b.  (Tx  = 

r  = 
m  — 
Y  = 


1.414. 

(Ty 

2.828. 

r 

-  1. 

m 

-  2. 

Y 

-  2X 

4.87. 

(Ty 

3.78. 

r 

0. 

m 

0. 

Y 

4. 

3.  a.  r 

b.  r 

1.414. 
4.242. 

-  1. 

-  3. 

-  3X  +  13. 
3.78. 

2.27. 
0. 
0. 
3. 


-  0.62. 

-  0.55. 


Pages  260-262 

1.  Mx  =  53.77f^.  My  =  S7.82. 

Gx  =  25.2^^.  (Ty  =  G4.34. 

If  X  =  75,  Yent,  =  $10.45.  Sy  =  S3.04. 

2.  r  =  .40.  Y  =  1.36X  +  108.5. 

X  =  0.127  +  13.2. 

3.  Mx  =  16.25  min.  My  =  82.125%. 

ax  =  5.04  min.  <Ty  =  9.25%. 

If  X  =  20,  Fe.*.  =  76%. 


r  =  0.72. 

y  =  0.124X  +  1.15. 


r  =  -  0.92. 
y  =  -  1.68X  +  109.4 


Sy  =  3.63%. 


Page  266 

1.  0.92. 

2.  0.85. 

Pages  270-76 

1.  (1)  y  =  0.075X  +  4.72. 

(2)  r  =  0.89.  Yes. 

(3)  If  X  =  40,  Y,,t.  =  7.72  tons. 

(4)  Sy  =  0.58  ton. 


Pi  II  =  0.64. 
Pi  III  =  0.62. 
Pii  III  =  0.78. 


2. 

r  =  0.92. 

3. 

r  =  0.63.  y 

=  1.21X  +  4.73. 

X 

=  0.32y  +  14.3. 

4. 

r  =  -  0.84.  Yes. 

7. 

r  =  0.60.  y 

=  0.85X  +  85.03. 

X 

=  0.43y  -  19.3. 

10. 

(1)  r  =  0.829. 

y  =  0.65X  +  2.75. 

(2)  y..,.  =  6%. 

(3)  My  = 

=  6.14%. 

(4)  X  =  1.056y  - 

-  1.26.            (5)  X.,,. 

=  6.13%). 

(6)  Mx  -  6.29%. 

(7)  = 

0.39%,  Sx 

=  0.50% 
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11.  r  =  0.80. 

12.  (2)  r  =  -  0.71. 


(3)  m  =  -  0.32. 

(5)  155.7^^,  123.7j^i,  91.7^f. 

13.  r  =  0.73. 

15.  (2)  If  Z  =  $175,  Y  =  $138.88. 


(4)  F  =  -  0.32X  +  187.69. 


(6)  Sy  =  33.6^. 
14.  r  =  0.90.  Yes,  spurious. 
16.  The  Bucknell  test. 


(3)  $0,746. 


CHAPTER  9 


Page  282 

2.  (1)  Xi  =  0.384X2  +  1.646X3  +  1.438. 
Pages  284^285 

2.  (2)  Xi  =  0.839X2  +  0.462X3  -  0.270. 
(4)  Ru2Z)  =  0.83,  ^1(23)  =  1.47. 

3.  (2)  7^1(23)  =  0.96,  aSi(23)  =  5.02. 

(3)  Xi  =  0.258X2  +  O.6O6X3  +  14.2. 

4.  (1)  Xi  =  0.575X2  +  1.092X3  +  15.982. 
(2)  158.6.  (3)  35.9. 

Page  288 

6.  (2)  JSi(23)  =  0.96,  ^1(23)  =  2.36. 

Page  290 

1.  (-  1,  3).  2.  (6,  0).  3.  (24,  -  16).  4.  (f,  ~  f). 
Page  292 

2.  (-  2,  1,  3).  3.  (-  1,  2,  ~  3). 
Page  293 

1.  (1)  -  15.  (2)  -  56. 

Page  297 

5.  72.88%. 

Pages  301-303 

1.  a.  Weight  =  0.994  Length  +  2.660  Breadth  -  112.217. 

b.  55.25  grams. 

e.  Weight  =  0.046  Length  +  1.056  Bulk  -  2.081. 

d.  Weight  =  1.098  Bulk  -  0.098  Breadth  +  2.416. 

e.  SwiLBr)  =  0.924,  SwiLBm     0.907,  SwiBitBr)  =  0.909, 

2.  a.  Xi  =  0.55X2  +  1.07X3  +  0.083X4  -  69. 

c.  -Ri(234)  =  0.826. 

d.  ri2  84  =  0.764,  ri8.24  =  0.676,     23  =  0.09. 
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CHAPTER  10 

Pages  310-311 

6.  a  =  1,  6  =  -  2,  c  =  2.  If  X  =  5,  7  =  17. 
6.  y  =  X3  -  2Z.  If  Z  =  5,  r  =  115. 

Page  313 

R  =  0.02792^  +  10.1368. 

Page  316 

3.  With  X  =  0  at  1924,  Y  =  3.483X  +  26.44. 
At  1929,  Z  =  5,  y  =  43.86  millions  of  dollars. 

Page  323 

2.  Using  L.S.,  p  =  30(0.99996)''. 


h_ 

1,000 

2,000 

5,000 

p 

28.9 

27.9 

24.9 

3.  L.S.  gives  T  =  17.8921  (0.9865)  ^ 

4.  With  ^  =  0  at  1920,  L.S.  gives  X  =  100,006(1.127)'. 

5.  L.S.  gives  H  =  0.86(1.39)^. 

Pages  329-330 

2.  The  points  (8,  23)  and  (20,360)  give  Y  =  0.045X3.  7.  T  -  49.5/V 
4.  F  =  0.119X1  ^.  5.  y  =  2.26^0-^ 

Page  344 

1.  a.  log  y  -  (log  3)X  +  log  1,  y  =  3^. 

b.  log  y  =  0.2X  +  log  1,  y  =  lol 

c.  log  y  =  o.ix  -h  log  1,  y  =  10^. 

d.  log  y  =  (log  5i)X  +  log  2,  y  =  2(5^). 

e.  log  y  =  -  O.IX  +  1,  y  =  10(10^  0"". 

f.  log  y  =  (log  2i)X  +  log  2-t,  y  =  2^. 
Pages  350-354 

2.  y  =  0.045X3.  4.  y  =  0.04X2.  6.  y  =  4(1.2)^. 
7.      =  125(1.649)'.                     10.  Y  =  2.54(1.16)^^. 

16.  Choosing  X  =  0  at  1909,  Y  =  0.305X  +  7.36. 

17.  Choosing  X  =  0  at  1907,  Y  =  14.375X  +  159.31. 
At  1915,  X  =  8,  y  =  274.31. 

At  1920,  X  =  13,  y  =  346.185. 

18.  Choosing  X  =  0  at  1909,   a.  y  =  74X  +  1,988.7. 
b.  y  =  109.8X  +  2,053.8. 

19.  Choosing  X  =  0  at  1915,  Y  =  1.35X  +  31.8. 


488 


ANSWERS  TO  EXERCISES 


Pages  357-361 

1.  With  Z  =  0  at  1910,  L.S.  gives  Y  =  0.435X  +  35.2. 

2.  L.S.  gives  V  =  499.82p-i  o^. 

3.  With  X  =  0  at  1910,  L.S.  gives 

Y  =  0.0574X2  +  2.67X  +  94.66. 

4.  With  X  =  0  at  1900,  L.S.  gives  Y  =  0.714(1.031)^. 
At  1915,  X  =  15,  y  =  1.13. 

At  1928,  X  =  28,  F  =  1.67. 
6.  Using  L.S.,  S  =  44.603(1.049)^. 

S  =  0.00147256>2  -  0.4741?  +  49.548. 

6.  L.S.  gives  V  =  3.1944  +  0.4516D  -  0.7792D2. 
If  D  =  0.9,  V  =  2.9697. 

7.  Using  first,  seventh,  and  ninth  points,  d  =  31.5  +  60(0.9038)'. 

8.  With  X  =  0  at  1910,  Y  =  19(1.086)^. 

9.  7  =  4.480Z>>-66»i. 

14.  L.S.  gives  with  ^  =  0  at  1909.5,  X  =  393.3(1.0743)*. 

16.  Using  first,  sixth,  and  eleventh  points,  y  =  10.1344  +  1.7521(1.2404)^. 


CHAPTER  11 

Pages  365-366 

1.  4.  .  2.  8.  3.  36.     '  4.  288. 

5.  504.  6.  2,730.  7.  3,024. 

Page  367 

1.  156.  2.  6,720. 

3.  a.  362,880.  b.  725,760.           c.  725,760.           d.  2,903,040. 

4.  720.           5.  10.           7.  30,240.          8.  34,650.           9.  2,520. 

Pages  369-370 

1.  45;  45;  4,950.  2.  50,063,860.  3.  5,880.       4.  45.       5.  63. 

6.  a.  126.    b.  84.  8.  302,400.  9.  878,948,939. 
12.  3,600.  13.  a.  700.  b.  1,408.          15.  n=ll,r  =  2. 
16.  n  =  6.  17.  n  =  10.  18.  n  =  7. 

Pages  371-372 

1  2.048  1.992 

*•  4.040)  4,040' 

2.  T2^,  rh,  etc.  M  =  3.43,   a  =  1.3. 

3.  0.0085.  4.  0.514. 

Pages  373-374 

1«  iy  if  2.  -g*^.  3.       iy  7. 

7.  i,  i,  i  8.  i,  i,  J.  9.  The  former. 
10.  Vj.                               11.  i.  12. 
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Pages  376-377 

^»    1.00  1  •      b.    1,00  !•  2.    1  i,050»  3.    fjj.  4. 

5.  a.  0.06.     b.  0.56.     c.  0.38.  6.  ^.  7.  f. 

Pages  380-382 
1.  a.  xi^-    b.  xi"S;-    c.         etc.     2,  1,  7,  21,  etc. 

15(5^)     .    2,906  .    4,651  ^  56 

3-  a.  b.  4.  6. 

7.  a.  0.2646.    b.  0.3483.  8.  0.09. 

80     ,    51  276 
9.  a.  ~-    b.  10.  — .  11.  25. 

12.  a.  10(.94)3(.06)2.    b.  (M)HM)\ 
c.  10(.94)3(.06)2  +  10(.94)2(.06)3  +  5(.94)(.06)*  +  (.06)^ 

13.  a.  (.95)\    b.  10(.95)2(.05)'\ 

c.  10(.95)H.05)3  +  5(.95)(.05)4  +  (.05)^ 

14.  b.  {My\    a.  10(.95)9(.05).    c.  (.95)^^  +  10(.95)^(.05). 

10 

d.  :::,oa(.95)'•(.05)^«-^  15.  25C>o(.9)2o(.l)^ 

r  =  5 

10 

16.  a.  ioCfi(.97)H.03)^    b.  2ioCr(.97)'-(.03)^«--. 

100 

17.  ZiooCr(.95)^(.05)^oo-r^  19.  |, 

r  =  90 

108     ,    799         36  3  11. 

20.  a.  —  •    b.  —  •    c.  21.  a.  f.    b.  .Ig. 

22. 

^'  1,024'    b.  c.  rh-        23.  a.  5.    b.  ^V-    c.  xl?' 

24.  4.  26.  i  26.  i.  27.  7^. 


CHAPTER  12 

Page  390 


a 

6 

c 

d 

Mo 

5 

1 

4 

2  and  3 

M 

5 

1 

3.6 

2.4 

(T 

0.91 

0.91 

0.6 

0.98 

«3 

-  0.73 

0.73 

1.33 

0.20 
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Page  396 

1.  A  B 


"XT 

X 

Graduated 
fix) 

X 

Graducited 

60.5 

3.5 

29.5 

2.0 

70.5 

28.3 

33.5 

17.6 

80  5 

99  0 

37  5 

70  3 

90.5 

198.0 

41.5 

164.1 

100.5 

247.5 

45.5 

246.1 

110.5 

198.0 

49.5 

246.1 

120.5 

99.0 

53.5 

164.1 

130.5 

28.3 

57.5 

70.3 

140.5 

3.5 

61.5 

17.6 

65.5 

2.0 

Total 

905.1 

Total 

1,000.2 

Page  404 

1.  a.  0.9773. 

b.  0.9836. 

c.  0.9834. 
2*  a*  2.14. 


d.  0.0227.  g.  0.0227. 

e.  0.9918.  h.  0.9892. 

f.  0.0084.  i.  0.0027. 

b.  0.65.      c.  1.655.      d.  0.6553. 


Pages  411-413 

1.  «  =  5.  No. 

2.  a.  ^  =  6.1.  Yes.    b.  t  =  3.46.  Yes.  From  point  of  view  of  chance, 
this  might  happen,  but  very  improbable. 

3.  0.0108. 


4. 


Grade 

F 

E 

D 

C- 

C 

c+ 

B- 

B 

B+ 

A- 

A 

A  ~f" 

Number 
receiving 
grade 

1 

7 

28 

79 

159 

226 

226 

159 

79 

28 

7 

1 

7.  The  values  of  t  for  the  five  parts  are:  —  oo  to  —  0.8415,  -  0.8415  to 
-  0.2533,  -  0.2533  to  +  0.2533,  0.2533  to  0.8415,  0.8415  to  +  oo. 

8.  929.0. 

9.  a.  Yes.  i  =  0.88.  b.  Yes.  t  =  2.5.   c.  No.  t  =  7. 
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jDlTlOTflZQ/i 

X 

OZfltUfrilfillf 

X 

DrHinntpR 

Civd  in/itpR 

Ordifuitfts 

0 

.000 

.000 

9 

175 

.176 

1 

000 

.0005 

10 

.122 

.121 

12 

.002 

.002 

11 

.067 

.065 

o 
o 

.VV/OO 

19 

097 

4 

.028 

.027 

13 

.0085 

.009 

5 

.067 

.065 

14 

.002 

.002 

6 

.122 

.121 

15 

.000 

.0005 

7 

.175 

.176 

16 

.000 

.000 

8 

.196 

.199 

11.  0.9.  12.  a.  0.95.   b.  0.95. 

13.  a.  0.076.   b.  0.076.  14.  2.5  inches. 

15.  a.  720  men.    b.  0.000.    c.  0.12. 

16.  a.  50  units,    b.  36  and  64  units. 

17.  a.  900.    b.  Yes.  a  =  28.6. 

19.  Q,  =  53.3,  Qs  =  66.7,  Q  =  6.7,  M.D.  =  8,  ^3  =  0,  a4  =  3, 
87th  percentile  =  71.2. 

20.  8.5  and  18.7.  21.  a.  0.24.   b.  0.02.   c.  0.06.    d.  0.007. 

Page  417 

1.  Using  M  =  39.835,  adjusted  cr  =  2.0322,  -  =  0.492078,  and  rounding 

c 

the  values  of  t  to  two  decimal  places,  we  have 


Theoretical 

Theoretical 

Theoretical 

Theoretical 

X 

fix) 

fix) 

fix) 

Ordinates 

Areas 

Ordinates 

Areas 

33 

7 

8 

42 

1108 

1110 

34 

32 

34 

43 

582 

592 

35 

116 

123 

44 

240 

252 

36 

329 

339 

45 

78 

81 

37 

737 

746 

46 

20 

21 

38 

1309 

1295 

47 

4 

4 

39 

1805 

1818 

48 

1 

1 

40 

1957 

1929 

41 

1669 

1646 

Total 

9994 

9999 
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2.  ilf  ==  67.92. 
a  =  2.42. 

i  =  0.4132. 
Values  of  i  are  rounded  to  two  decimal  places. 


3.  M  =  6.06. 
a  =  0.20. 

-  =  5.00. 


Ix 

Theoretical 

f(x) 

57.5 

0.2 

58.5 

0.6 

59.5 

O  A 

2.4 

60.5 

7.6 

61.5 

21.4 

62.5 

47.0 

63.5 

91.7 

64.5 

154.1 

65.5 

207.9 

66.5 

242.4 

67.5 

244.8 

68.5 

199.2 

69.5 

140.7 

70.5 

85.6 

71.5 

41.8 

72.5 

18.0 

73.5 

6.5 

74.5 

2.0 

75.5 

0.5 

76.5 

0.2 

Total 

1515.2 

Ix 

Theoretical 
fix) 

5.45 

A  O 

4.3 

5.55 

14.8 

5.65 

40.4 

5.75 

86.3 

5.85 

144.3 

5.95 

188.9 

6.05 

193.5 

6.15 

155.3 

6.25 

97.6 

6.35 

47.9 

6  45 

18.5 

6.55 

5.5 

6.65 

1.3 

6.75 

0.3 

6.85 

0.0 

Total 

998.9 

4.  F  = 


924(4) 


11.0668>/27r 


2  =  (333.972)0(0  where  t  =  ^ 


74,2 


11.0668 


CHAPTER  13 


Pages  438-439 

2.  5,625;  22,500. 

4.  0.0525  million  per  cu.  mm. 

6.  Yes.  12.  Yes. 


3.  0.034  pound. 

5.  a.  0.9255.    b.  0.0026. 

0.62.  13.  Yes. 
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Pages  449-450 
1.  a. 


1st 
100 

2nd 
100 

3rd 
lUO 

/4h 
100 

5th 
100 

6th 
100 

7th 
100 

8th 
100 

9th 
100 

10th 
100 

Total 

M 

117.15 

120.85 

117.35 

122.95 

119.25 

118.45 

118.05 

113.65 

119.45 

120.25 

118.74 

a 

13.8 

17.4 

17.6 

21.4 

15.5 

17.8 

18.0 

13.4 

17.5 

16.2 

17.2 

b.  Mean  of  means  =  118.74.  Mu  ^  118.74. 

c.  Mean  of  o-'s  =  16,9.  (Tu  =  17.2. 

d.  Eight.  The  five  per  cent  limits  are  115.37  and  122.11. 

e.  Seven.  The  five  per  cent  limits  are  14.8  and  19.6. 

f.  Sampling  probably  went  awry  on  the  4th  100  and  the  8th  100,  for 
both  the  moan  and  the  standard  deviation  are  outside  the  five  per 
cent  levels. 

2.  Difference  not  due  to  chance,   t  ~  13+. 

3.  Yes.  ^  =  6+. 

4.  Class  of  1943  is  poorly  prepared,   t  =  4.4. 
Class  of  1945  is  within  5  per  cent  level.  /  =  1.7. 

6.  (Tp-p  =  0.0078.   Difference  not  significant.  ^  =  0.5+. 

1  2 

6.  (Tp-p^  =  0.0417.   Difference  not  significant,   i  =  1.1 +  . 
Pages  457-464 

1.  Em  =  1-57.  Even  chance  that  sample  mean,  149.8,  does  not  differ 
from  Mu  by  more  than  zt  1.57. 

a^j.  =  1.64.  A  two  to  one  chance  that  the  sample  tr,  42.47,  does  not 
differ  from  o-„  by  more  than  dh  1.64. 

2.  a.  About  0.58.    b.  About  0.62. 

3.  a.  <tm  =  0.364,  cTfj.  =  0,257  p.b.  per  min.    b.  About  0.994. 

4.  r  =  0.77,  cTr  =  0.05.  A  two  to  one  chance  that  the  sample  r  does  not 
differ  from  the  universe  r  by  more  than  ±0.05.   Er  =  0.03, 

5.  Em  =  0.0137  inch.  An  even  chance  that  the  sample  mean,  39.835 
inches,  does  not  differ  from  the  universe  mean  by  more  than  0.0137 
inch. 

6.  t  =  2.7  and  the  difference  is  probably  significant. 

1.  t  —  167,  and  the  difference  is  certainly  significant.   In  fact.  Group  I 

was  American  soldiers  and  Group  II  was  Japanese  soldiers. 
8.  For  National  League,  M  =  0.283,  a  =  0.086. 

For  American  League,  M  =  0.278,  a  =  0.085. 

t  =  0.47,  and  the  difference  is  not  significant. 
10.  t  =  14.6,  and  the  difference  in  the  means  is  sufficient  to  warrant  the 

conclusion  that  Scots  are  taller  than  Englishmen. 
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11.  Yes.  k  =  6.8. 

13.  t  =  1.5.  14.  t  =  11.7. 


Ist  Group 

2nd  Group 

N 

72,127 

17,986 

M 

$8.37 

$9.59 

$2.49 

$2.43 

t  =  60.  Hence  {M2  —  Mi)  is  significant. 

16.  t  -  35.3.  17.  96. 

18.  t  =  4.15.  19.  t  =  7.96;  t  =  0. 

20.  t  =  2.53.  Probably  significant. 

21.  Yes.   t  =  16+.  22.  Yes.  t  =  11.6. 

23.  a.  About  0.73.   b.  About  44  days.   c.  N  =  25. 

24.  Yes.   For  t  =  15+  for  heights  and  13+  for  weights. 

25.  Yes. 
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The  numbers 

Absolute  error,  16 

Absolute  value  of  a  number,  121 

Accuracy,  in  measurements,  14;  meas- 
urements of,  15 

Aggregative  relatives,  simple,  179; 
weigfited,  185 

Analysis,  statistical,  2 

Arithmetic  mean,  jis  a  moment,  62; 
calculation  of,  60,  62,  73;  criticism 
of,  99;  defined,  60;  of  relatives,  181- 
182,  18^190;  probable  error  of, 
143,  432;  standard  deviation  of,  144, 
430;  standard  error  of,  144 

Array,  nature  of,  24 

Asymmetry,  defined,  43,  52;  Pearson's 
measures  of,  151,  152;  positive  and 
negative,  152-153;  quartile  mea,sure 
of,  156;  third  moment  as  measure 
of,  157 

Average,  characteristics  of  a  good,  59; 

uses  of  an,  59;  of  relatives,  182-184, 

188-200 
Average  deviation,  120 

Base,  in  index  number  construction, 
175 

Benson,  Paul,  Preface 

Bernoulli,  James,  395 

Bernoulli  Theorem,  395 

Bessel,  Friedrich  Wilhelm,  138,  452 

Bias,  downward,  197;  in  averages  of 
relatives,  197;  in  use  of  weights,  198; 
type,  197;  upward,  197;  weight,  197 

Binomial  expansion,  367,  383-395 

Binomial,  point,  arithmetic  mean  of, 
388;  general  form  of,  384;  gradu- 
ation of  data  by,  391-395;  mode  of, 
386;  skewness  and  excess  of,  389- 
390;  standard  deviation  of,  388 

Birge,  R.  T.,  456 

Bowley,  A.  L.,  5,  156 

Bradstreet's  index  number,  180 

Brahe,  Tycho,  339 

Bravais,  A.,  237 

Brinton,  W.  C,  466 

Brown,  W.  and  Thomson,  G.  H.,  240 

Bruce,  C.  W.,  110 

Burgess,  Robert  W.,  98,  197 


refer  to  pages. 

Camp,  B.  H.,  36,  465 
Carver,  H.  C,  Preface,  456 
Central  tendency,  52,  98-101;  meas- 
ures of,  59-110 
Chaddock,  R.  E.,  269,  465 
Charts,  construction  of,  37-53 
Class,  boundary,  26;   frequency,  25; 
interval,  25,  26,  30;  Hmits,  26,  30; 
mark,  27;  unit,  70,  71,  72;  width, 
25 

Classification  of  data,  23 

Coefficient  of,  correlation,  238,  245, 
246,  254;  multiple  correlation,  281, 
284,  295,  300,  305;  regression,  250, 
295;  skewness,  151,  152,  156,  157; 
variation,  131 

Column  diagram  (see  Histogram),  con- 
struction of,  37;  defined,  37 

Combination,  366 

('ompound  interest  law,  316 

Confidence  limits,  410 

Continuous  variate,  6 

Coolidge,  J.  L.,  379 

Correlation,  by  ranks,  263-265;  co- 
efficient of,  237,  245,  246,  254,  265; 
definition  of,  240;  index,  357;  mul- 
tiple, 277-305;  non-linear,  355; 
partial,  295;  perfect,  239,  287;  sum- 
mary of,  247;  table,  254;  versus 
causation,  267 

Craig,  A.  T.,  391 

Craig,  C.  C,  456 

Crathorne,  A.  R.,  Preface 

Crowder,  W.  F.,  35,  201,  288 

Croxton  and  Cowden,  453,  465 

Curve  fitting,  by  averages,  313;  by 
least  squares,  307,  314,  331;  by 
moments,  307,  331;  by  selected 
points,  311;  of  exponential  function, 
316;  of  hyperbola,  333;  of  modified 
exponential,  334;  of  modified  power 
function,  337;  of  normal  curve,  413- 
417;  of  parabola,  330-333;  of  power 
function,  323;  of  straight  fine,  216, 
222,  311,  314 

Curves,  cumulative,  50;  exponential, 
316;  hyperboHc,  333;  J-shaped.  52; 
mound-shaped,  44,  52;  normal,  63, 
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134,  363-409;   parabolic,  82,  330; 
skewed,  52,  152,  153,  383,  384 
Czuber,  Emanuel,  85 

Data,  statistical,  1;  grouped,  23;  un- 

grouped,  23 
Davenport,  D.  H.,  458 
Davies,  George  R.,  35,  201,  288 
Decile,  119 

Degrees  of  freedom,  453 

Deming,  W.  E.,  456 

De  Moivre,  Abraham,  363,  396,  397 

Dependent  events,  376 

Dependent  variable,  6 

Determinants,  defined,  289,  290,  292; 
finding  mode  by,  110;  in  multiple 
correlation,  293-305 

Deviation,  mean,  121;  probable,  119; 
quartile,  115;  standard,  of  a  differ- 
ence, 444;  standard,  of  a  distribu- 
tion, 125-130;  standard,  of  the 
mean,  144,  430;  standard,  of  the 
standard  deviation,  144,  443 

Differences,  standard  error  of,  444-445 

Differencing,  process  of,  307;  use  of, 
in  curve-fitting,  309-310 

Discrete  variate,  6,  15 

Dispersion,  43,  111;  meaning  of,  111- 
115;  measures  of,  113 

Distribution  of  means,  defined,  143- 
144;  excess  of,  441;  illustrated,  139, 
140,  426;  mean  of,  144,  429;  prob- 
able error  of,  144,  432;  standard 
deviation  of,  144,  430;  standard 
error  of,  144;  skewness  of,  439 

Distribution  of  standard  deviations,  de- 
fined, 144,442;  mean  of ,  145;  stand- 
ard deviation  of,  145,  442;  standard 
error  of,  144 

Distributions,  asymmetrical,  52;  cumu- 
lative frequency,  48;  J-shaped,  52; 
mound-shaped,  44,  52;  normal,  53, 
414-416;  simple  frequency,  25; 
symmetrical,  52;  temporal,  44;  U- 
snaped,  52 

Empirical  curves,  defined,  210,  306; 
Kmitations  of,  338;  methods  of  fit- 
ting. Chapter  10 

Empirical  equation,  210,  306,  338 

Empirical  probabihty,  369 

Equation  of,  exponential  functions,  316; 
hyperbola,  333;  hyperplane,  298; 
modified  exponential,  334;  modified 
power  function,  337;  normal  curve, 
119,  396,  400;  plane,  278;  power 
function,  323;  quadratic  parabola, 
330;  straight  line,  210,  3U 


Error,  absolute,  16;  possible,  15;  prob- 
able, 137;  probable,  of  mean,  143, 
432;  probable,  of  standard  deviation, 
144,  443;  relative,  16;  standard,  of 
estimate,  233,  239,  280,  286,  295, 
300,  304 

Excess  (kurtosis),  43,  158,  389,  439 
Expectation,  374 

Exponential  function,  fitting  data  to, 
316,  343;  when  to  use  with  empirical 
data,  317-318 

Ezekiel,  Mordecai,  466 

Factor  reversal  test,  199 
Fisher's  Ideal  Index,  198 
Fisher,  Irving,  195,  198,  199,  200 
Fisher,  R.  A.,  364,  421,  447,  452,  453, 
465 

Forsyth,  C.  H.,  Preface 

Freeman,  H.  A.,  439 

Frequency,  class,  25,  422;  cumulative, 

48;  relative,  372 
Frequency  curves,  40,  397;  types  of, 

52,  418 

Frequency  distribution,  binomial,  382- 
395;  cumulative,  48;  normal,  395- 
410,  413-417;  simple,  25 

Frequency  polygon,  38 

Frequency  table,  4,  25 

Function,  7 

Gale,  A.  S.,  271 

Gauss,  Carl  Friedrich,  138,  363,  396 
Gavett,  G.  I.,  271 

Geometric  mean,  computation  of,  90; 
criticism  of,  101;  defined,  87;  of 
relatives,  180-182,  191-193;  use  of, 
88 

Glover,  J.  W.,  Preface,  379, 467 

Goodness  of  fit,  tests  for,  21 1,  213,  230, 
232,  233,  286,  300,  304,  416 

Graduation  of  a  frequency  distribution, 
by  point  binomial,  391-395;  by 
normal  curve,  413-417 

Graphical  representation,  37,  340-350; 
01  cumulative  distributions,  50;  of 
simple  frequency  distributions,  37, 
38,  39;  of  temporal  distributions, 
43-48;  with  logarithmic  paper,  346; 
with  semi-logarithmic  paper,  342 

Growth,  law  oi  organic,  316 

Harmonic  mean,  computation  of,  93; 

defined,  92;  of  relatives,  181,  188; 

uses  of,  93.  95-97,  181,  198 
Hall,  Winfield  S.,  171 
Haskell,  S.  C,  46,  466 
Histogram,  37 
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Holzinger,  Karl,  465 
Hotelling,  Harold,  153 
Huntington,  E.  V.,  5 
Hyperbola,  333 

Independent,  events,  375;  variable,  6 
Index  numbers,  174-202;  as  average 

of  relatives,  180-182,  188-193;  bias 

in,    197-198;    defined,    174,  177; 

Fisher's  Ideal,  198;  purpose  of,  174; 

unweighted,    178-184;  weighted, 

185-200 

Index  of  precision,  defined,  399;  use 
of,  433 

Interpretation  of  statistical  results,  3, 

364,  410,  Chapter  13 
Interval,  class,  25,  26,  30 

Jackson,  Dunham,  220,  456 

Karsten,  K.  G.,  466 

Kendall,  M.  G.,  1,  59,  145,  465 

Kenney,  J.  F.,  465 

Kepler,  Johann,  339 

King,  W.  L,  49,  81,  200 

Kurtosis,  43 

Kurtz,  Edwin  B.,  172 

Laplace,  Pierre  Simon,  363,  396 
Ivcast  squares,  principle  of,  21 1 ;  fitting 

a  parabola  by,  330;  fitting  a  straight 

line  by,  214-222,  314 
Lee,  Alice,  463 
Levels  of  significance,  410 
Linear  trends,  203 
Lines  of  regression,  218,  248-249 
Lipka,  Joseph,  466 
Logarithmic  paper,  346 

May,  Mark,  301 

Mean,  arithmetic,  62;  geometric,  87; 

harmonic,  92 
Mean  deviation,  computation  of,  122; 

defined,  121 
Median,  49,  76;  computation  of,  51, 

78,  79;  defined,  49,  76 
Mill,  J.  S.,  269 

Mills,  F.  C.,  34,  188,  357,  458,  466 
Mitchell,  Wesley  C.,  200 
Modal  class,  81 

Mode,  80;  approximate,  80,  81,  84,  85; 
criticism  of,  100;  crude,  80;  true,  80 

Modified  exponential  function,  fitting 
data  to,  333;  when  to  use  with  em- 
pirical data,  333 

Modified  power  function,  fitting  data 
to,  337;  when  to  use  with  empirical 
data,  337 

Moment,  arithmetic  mean  as,  62-66 


Moments,  adjusted,  of  a  distribution, 
163;  computation  of,  164;  method 
of,  in  curve-fitting,  160;  unadjusted, 
of  a  distribution,  159;  of  point  bi- 
nomial, 387-389;  of  normal  curve, 
405 

Multiple  correlation,  coefheient  of,  281, 

287;  defined,  277,  288 
Mutually  exclusive  events,  374 

Normal  curve,  defined  by  equation,  53, 
119,  396;  derivation  of  equation  to, 
397;  graduation  of  distribution  by, 
413-417;  history  of,  363,  396;  mo- 
ments of,  405;  properties  of,  401; 
uses  of,  134,  397,  405-409 

Normal  equations,  215,  217,  278,  283, 
298,  303 

Null  hypothesis,  447 

Numerical  value  of  a  number,  121 

Ogive,  48 

Organic  growth,  law  of,  316 
Organization  of  data,  1,  5 

Parent  population  (universe),  3,  23, 

303,  419,  451 
Parkos,  A.  S.,  and  Drummond,  J.  C, 

450 

Pearl,  Pavmond,  105,  301,  466 
Pearson,  K.  S.,  439 

Pearson,  Karl,  85,  151,  237,  391,  416, 
419,  456,  463,  467 

Percentiles,  119 

Permutation,  364 

Point  binomial,  383-395 

Power  function,  fitting  data  to,  323. 
347;  when  to  use  with  empirical 
data,  324 

Precision,  index  of,  399,  433 

Preliminary  sheet,  25,  253 

Probability,  369;  a  priori,  372;  em- 
pirical, 369;  theorems  on,  374-378 

Probable  deviation,  119 

Probable  error,  118,  137,  403;  defined, 
137,  403;  of  any  measure,  137,  403; 
of  the  arithmetic  mean,  143,  432;  of 
the  standard  deviation,  146,  443 

Quadratic  parabola,  fitting  data  to,  330; 
in  finding  approximate  mode,  83; 
when  to  use  with  empirical  data,  330 

Quartile  deviation,  115 

Quartiles,  computation  of,  117,  120; 
defined,  115;  in  measuring  disper- 
sion, 118;  in  measuring  skewness, 
156 

Quetelet,  Adolphe,  363 
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Range,  24,  113,  114 

Regression,  coefficients  of,  250,  295; 
of  Y  on  X,  250;  of  X  on  Y,  250; 
multiple,  295;  plane,  278;  hyper- 
plane,  299 

Relatives,  defined,  174;  chain,  176; 
fixed  base,  176;  hnk,  176;  simple 
aggregative,  179;  simple  arithmetic 
mean  of,  181;  simple  geometric 
mean  of,  181 ;  simple  harmonic  mean 
of,  181;  weighted  aggregative,  185; 
weighted  arithmetic  mean  of,  188; 
weighted  geometric  mean  of,  191; 
weighted  harmonic  mean  of,  188 

Relative  error,  defined,  16;  in  a  prod- 
uct, 19;  in  a  quotient,  20 

Relative  frequency,  369,  424 

Relative  vanabiUty,  113,  118,  122,  131 

Reliability,  defined,  143,  421;  of  a 
difference,  444-445;  of  the  mean, 
143,  453;  of  the  standard  deviation, 

145,  453 

Repeated  trials,  theorem  of,  378 
Residual,  210 
Riebesell,  Paul,  42 

Rietz,  H.  L.,  164, 416, 424, 456,  465,  466 
Robinson,  George,  168 
Rounding  off  numbers,  16 
Running,  T.  R.,  310,  466 

Sample,  3,  23,  363,  419;  small,  450 

Scarborough,  J,  B.,  18,  466 

Scatter  diagram,  234,  253 

Secrist,  Horace,  53,  250 

Secular  trend,  226 

Selection  (see  Combination),  366 

Semi-logarithmic  paper,  342 

Sheppard^s  Corrections,  163-167 

Shewhart,  W.  A.,  456 

Significant  difference,  409,  443-448 

Significant  figures,  15 

Simple  frequency  distribution,  25 

Skewness,  150-157;  defined,  43,  150; 
measurement  of,  151-157 

Slope  of  a  straight  line,  205 

Snedecor,  G.  W.,  466 

Solomons,  Leonard  M.,  153 

Sorenson,  Herbert,  35,  296,  438,  466 

Standard  deviation,  computation  of, 
126,  127,  129,  130;  defined,  125;  in 
class  frequencies,  125,  423;  of  the 
mean,  144,  430,  453;  of  a  percentage, 
425;  of  the  standard  deviation,  144- 

146,  443,  453 

Standard  error,  of  estimate,  233,  280; 
of  the  mean,  144;  of  the  standard 
deviation,  144 


Standard  unit,  162,  250,  400 

Statistical,  analysis,  2;  constant,  3; 
data,  1;  induction,  364,  420;  in- 
ference, 364;  methods,  1 

Stirling's  Formula,  379 

Straight  line,  fitting  observed  data  to, 
210,  311-315;  intercepts  of,  207; 
properties  of,  206:  slope  of,  205 

Summation,  defined,  7;  limits  of,  8; 
theorems  on,  9 

Surface,  F.  M.,  105,  301 

Symmetrical  distributions,  52 

Tabular  presentation,  23,  25 

Tabulation  of  data,  23,  53 

Tallying,  25,  253 

Temporal  distribution,  44,  226 

Tests  of  significance,  409,  446 

Thurstone,  L.  L.,  34,  411 

Time  reversal  test,  197 

Time  series,  43;  fitting  a  straight  line 

to,  226,  342 
Tippett,  L.  H.  C,  449,  465 
Treloar,  Alan  E.,  364,  462,  465 
Trend,  hnear,  203;  non-linear,  306 
True  class  limits,  26 
Tycho  Brahe,  339 
Tyler,  R.  W.,  Preface 

Unit,  cla.ss,  70,  71,  72,  128,  161,  245; 

standard,  162 
Universe,  3,  23,  363,  419 
Unweighted  index  numbers,  179-184 

Variability,  absoUite,   113;  relative, 

118,  122,  131-133 
Variable,  dependent,  6;  independent,  6 
Variance,  126,  442 

Variates,  6;  continuous,  6;  discrete,  6 
Variation,  coefficient  of,  131 

Walker,  H.  M.,  1 
Walsh,  C.  M.,  200 
Watkeys,  C.  W.,  271 
Waugh,  Albert  E.,  412,  465 
Weighted  aggregative  relative,  185 
Weighted  averages,  188-193 
Weighted  index  numbers,  188,  19L 

195-199 
Weighted  mean,  67,  90 
Wembridge.  H.  A.,  148 
White,  R.  C.,  34 
Whittaker,  E.  T.,  168 
Winfrey,  Robley,  172 
Wolfenden,  H.  H.,  466 

Yoder,  Dale,  35 

Yule,  G.  U.,  1,  69,  145,  465 


